Some ideas #300

antoineeripret · 2023-08-28T13:44:01Z

antoineeripret
Aug 28, 2023

Firstly, thank you for creating and updating a library that is now an essential for me. Really appreciate that.

I have a couple of features that I'd live to discuss with you. I'll try to keep my comment as short as possible and let me know what you think.

Custom grouping: that'd be wonderful to be able to provide a dict for our architecture and have a new column using this segmentation. For instance, the following dictionary could be passed as an argument for the adv.crawl method.

pages = {
            'home':  {
                'url': 'https://www.isofilter.es/',
                'type': 'equals'
            }, 
            'product': {
                'url': '/producto/',
                'type': 'contains'
            }
        }

Add data from GSC / GA: using a client-secrets.json key, we could add this information for a crawl or a sitemap_to_df method call. Using this library or something custom-built if you don't want to add dependencies.

I'm not sure about GA, because GA4 API is still fresh and may be updated during the upcoming months.

More info for the sitemap_to_df: I often find myself using the sitemap_to_df and url_to_df methods together with a pd.merge to have all the information I need. Won't it make sense to add the columns you already have in the url_to_df when you call sitemap_to_df?

Thanks in advance for your feedback!

eliasdabbas · 2023-08-28T19:50:03Z

eliasdabbas
Aug 28, 2023
Maintainer

@antoineeripret Thank you so much for the inputs. Always happy to get feedback/suggestions. My thoughts:

Custom grouping: Tagging each URL by flexible segmentation is definitely a key feature. I think you can easily do that with url_to_df, but you'll need to do that after crawling I think. Otherwise how would you know that you will have /product/color/ for example in your URLs?

crawldf = pd.read_json('output_file.jl', lines=True)
urldf = adv.url_to_df(crawldf['url'])

crawldf['product'] = urldf[urldf['dir_1'].eq('producto')]
crawldf['shoes'] = urldf[urldf['dir_1'].eq('producto') & urldf['dir_2'].eq('shoes')]
crawldf['offers'] = urldf[urldf['url'].str.contains('oferta')]

Or was there something else you wanted to achieve with this?

Data from other sources: Of course that's great, and crucial. Can't we use google-searchconsole to get the URL data, then advertools to get XML sitemaps or crawl a website, and then combine them together? If another library has implemented something and is maintaining it, it's usually best to simply use it. Happy to explore further, or look into a specific example that you might have.
More info into sitemap_to_df: Actually url_to_df was born because of exactly that reason. That most of the time you would end up wanting to split URLs and analyze website segments. This can be done with two additional lines of code: one call to url_to_df, and then concatenating the two dataframes with axis1= (no need to merge since they have the same URLs). Unless I'm missing something?

Let me know your thoughts, and we'll discuss further.

Thanks again!

6 replies

eliasdabbas Aug 29, 2023
Maintainer

Thanks @antoineeripret

For grouping/segmentation I think it's crucial especially for larger sites when you want to handle each segment separately, compare segments, or find if certain issues are related to certain segments. Can also be used for sharing work as well. You analyze section A, and I analyze section B.

I think the most important thing is that this segmentation can be done flexibly depending on your needs. Maybe it's not the products section, but all URLs with parameter "color" for example that are causing an issue. So I think this would be done interactively, and multiple segments/filters can be used and compared.

I'm curious, how would the data be written in your suggestion if this was implemented in a perfect way?

antoineeripret Aug 30, 2023
Author

Hi @eliasdabbas,

IMO, there are two approaches:

URL-only: using the dict I provided as an example in my first message, you'd specify a list of rules (contains, notContains, includingRegex, excludingRegex) based on the URL only. When you'd load the result from a crawl (using pd.read_json), you'd get an extra column per key in the pages dict with a boolean value. That would allow a page to belong to more than one segment.
More fields: one cool thing that crawlers such as Screaming Frog or Oncrawl do is to allow you to build segmentation based on the URL but also on other fields, such as the response code or the page size. We can imagine something similar, even though that'd complexify the implementation (IMO) of this functionality.

Which one would make more sense in your opinion?

eliasdabbas Aug 30, 2023
Maintainer

Both make sense, and both are already doable. The status code or size idea are even more reasons to keep this to after crawling.

Just like with sitemaps, we can create a URL data frame using url_to_df, concatenate with the crawl df, and use any combination of criteria to classify/group and analyze groups of pages.

Here are some similar thoughts: https://www.kaggle.com/code/eliasdabbas/analyzing-websites-by-section

Does this make sense?

antoineeripret Aug 31, 2023
Author

@eliasdabbas, you are right 😅 ! I really should have a better look at your Kaggle's notebooks, these are gems !

Thank you for your thoughts and if I have other ideas, I'll let you know!

eliasdabbas Aug 31, 2023
Maintainer

@antoineeripret
Great! Thanks again for the suggestions. Keep them coming. Always good to consider different approaches.

antoineeripret · 2024-10-15T20:46:58Z

antoineeripret
Oct 15, 2024
Author

Hi @eliasdabbas,

Long time no speak :)

I've been using your library for some projects lately and I wondered if it were possible to add the meta robots to the default elements being retrieved by the spider.py file.

Right now, I often add it as such:

adv.crawl(
    "https://www.example.com", 
    output_file='crawl_results.jl', 
    follow_links = True, 
    xpath_selectors={
        #meta robots tag 
        'meta_robots': '//meta[@name="robots"]/@content',
    }
)

While I do understand that the you can't obviously add every onpage elements by default, this one would be pretty useful without having to write custom code because:

It's commonly used in SEO
The amount of code required to add this (mainly spider.py unless I'm mistaken) is minimal. You'd copy what has already been done for canonical.

Happy to create a PR if the idea makes sense for you :)

Thank you !

4 replies

eliasdabbas Oct 16, 2024
Maintainer

Thanks again @antoineeripret
Please keep them coming.

Perfectly valid issue, very useful, and most people would most likely want it.

What I'm trying to figure out is how to also cater for the possibility of the meta "name" to be something other that "robots".

To use one of the examples in the page you shared:

<meta name="googlebot-news" content="nosnippet">

In this case the XPath would not capture this, even though this is a robots command.

My concern is that communicating that the crawler gets robots directives, a user tries it, doesn't get any robots tags, and assumes they don't exist, even though they do (with a different name).
This could be misleading.

Maybe modify the XPath to include all available Google user agents?

What about other bots? I honestly don't know if this is only for Google's bots or valid for others as well (baidu, naver, yandex...)? Do you know.

I'd love to add this definitely.

antoineeripret Oct 16, 2024
Author

Hi @eliasdabbas,

Maybe modify the XPath to include all available Google user agents?
What about other bots? I honestly don't know if this is only for Google's bots or valid for others as well (baidu, naver, yandex...)? Do you know.

There are two approaches I can think of:

Use a default value (robots in that case) and let the user know that if (s)he wants to retrieve a custom robots directive, the xpath_selectors needs to be used.
Create a new parameter for the adv.crawl method where the user can indicate what value must be searched

What do you think?

Thank you !

eliasdabbas Oct 18, 2024
Maintainer

I'm thinking of adding an xpath that matches name="robots" and/or any of a known list of google bots:

Google-CloudVertexBot
Google-Extended
Google-InspectionTool
GoogleOther
GoogleOther-Image
GoogleOther-Video
Googlebot
Googlebot-Image
Googlebot-News
Googlebot-Video
Storebot-Google

List extracted from the Common crawlers link on this page:

https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers

I didn't include bots on these pages:

Special-case crawlers
User-triggered fetchers

The docs would clearly reflect this list and inform the user which bots are supported (in case theirs is not, they'll have to use a custom xpath as you suggest).

What do you think?

Is this the right list?
Any other suggestion?

antoineeripret Oct 20, 2024
Author

Hi @eliasdabbas,

I think this is the best approach, because you'll already cover most usages with the generic robots tag.

I wouldn't try to add all of them if your list already covers the vast majority with the generic ones + Google''s.

antoineeripret · 2024-11-13T14:50:18Z

antoineeripret
Nov 13, 2024
Author

Hello @eliasdabbas,

It's me again :)

Do you think it would be complicated to remove some default columns from the default advertools' behavior if the user decides to do so?

Situation

Crawling a huge website (>1M)

Problem

The crawl generates a significant amount of data for the output.jl file. In one of my crawls, I got >50GB for instance. To analyze the data simply, I need to process the file and keep only the columns I want, otherwise Pandas / Polars / DuckDB are too slow.

Solution

Modify the crawl method to allow the user to indicate what elements, included by default in the spider, can be removed.

I'm not sure if this is an edge case and you'd rather leave the process I described out of your logic (at the end of the day, it's doable easily), but I wanted to share it with you :)

What's your take on this?

As always, thank you !

2 replies

eliasdabbas Nov 13, 2024
Maintainer

@antoineeripret Thanks again. Always great to get feedback/suggestions.

Short answer: Yes, this can be very helpful in reducing the size of the output file. It might be a corner case in terms of number of crawls, but not in terms of number of URLs crawled. It can be added.

Quick fix: you might be able to currently only read the columns you want, then analyze efficiently:

for i, chunk in enumerate(pd.read_json('output.jl', lines=True, chunksize=5000)):
    chunk[[COL_A, COL_B, COL_C]].to_csv(f'crawl_test/output_{i}.csv', index=False)

import os
final_df = pd.concat([pd.read_csv(f'crawl_test/{file}') for file in os.listdir('crawl_test/')], ignore_index=True)

This will go through output.jl chunk by chunk, where each chunk is 5,000 pages (or any number you want). Then it saves the desired columns only to a csv file named crawl_test/output_{i}.csv.
Then they can be combined together into a single data frame.

Longer answer: I opened a special discussion for this so we can discuss in more detail #391

antoineeripret Nov 14, 2024
Author

Hi @eliasdabbas ,

Sure, the solution is easy but the underlying issue (I'll explain it better in the linked discussion thread) will persist. I'd completely understand if you want to keep things simple though :)

See you there :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some ideas #300

{{title}}

Replies: 3 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Some ideas #300

antoineeripret Aug 28, 2023

Replies: 3 comments · 12 replies

eliasdabbas Aug 28, 2023 Maintainer

eliasdabbas Aug 29, 2023 Maintainer

antoineeripret Aug 30, 2023 Author

eliasdabbas Aug 30, 2023 Maintainer

antoineeripret Aug 31, 2023 Author

eliasdabbas Aug 31, 2023 Maintainer

antoineeripret Oct 15, 2024 Author

eliasdabbas Oct 16, 2024 Maintainer

antoineeripret Oct 16, 2024 Author

eliasdabbas Oct 18, 2024 Maintainer

antoineeripret Oct 20, 2024 Author

antoineeripret Nov 13, 2024 Author

Situation

Problem

Solution

eliasdabbas Nov 13, 2024 Maintainer

antoineeripret Nov 14, 2024 Author

antoineeripret
Aug 28, 2023

Replies: 3 comments 12 replies

eliasdabbas
Aug 28, 2023
Maintainer

eliasdabbas Aug 29, 2023
Maintainer

antoineeripret Aug 30, 2023
Author

eliasdabbas Aug 30, 2023
Maintainer

antoineeripret Aug 31, 2023
Author

eliasdabbas Aug 31, 2023
Maintainer

antoineeripret
Oct 15, 2024
Author

eliasdabbas Oct 16, 2024
Maintainer

antoineeripret Oct 16, 2024
Author

eliasdabbas Oct 18, 2024
Maintainer

antoineeripret Oct 20, 2024
Author

antoineeripret
Nov 13, 2024
Author

eliasdabbas Nov 13, 2024
Maintainer

antoineeripret Nov 14, 2024
Author