Some ideas #300
Replies: 3 comments 12 replies
-
@antoineeripret Thank you so much for the inputs. Always happy to get feedback/suggestions. My thoughts:
crawldf = pd.read_json('output_file.jl', lines=True)
urldf = adv.url_to_df(crawldf['url'])
crawldf['product'] = urldf[urldf['dir_1'].eq('producto')]
crawldf['shoes'] = urldf[urldf['dir_1'].eq('producto') & urldf['dir_2'].eq('shoes')]
crawldf['offers'] = urldf[urldf['url'].str.contains('oferta')] Or was there something else you wanted to achieve with this?
Let me know your thoughts, and we'll discuss further. Thanks again! |
Beta Was this translation helpful? Give feedback.
-
Hi @eliasdabbas, Long time no speak :) I've been using your library for some projects lately and I wondered if it were possible to add the meta robots to the default elements being retrieved by the spider.py file. Right now, I often add it as such: adv.crawl(
"https://www.example.com",
output_file='crawl_results.jl',
follow_links = True,
xpath_selectors={
#meta robots tag
'meta_robots': '//meta[@name="robots"]/@content',
}
) While I do understand that the you can't obviously add every onpage elements by default, this one would be pretty useful without having to write custom code because:
Happy to create a PR if the idea makes sense for you :) Thank you ! |
Beta Was this translation helpful? Give feedback.
-
Hello @eliasdabbas, It's me again :) Do you think it would be complicated to remove some default columns from the default advertools' behavior if the user decides to do so? SituationCrawling a huge website (>1M) ProblemThe crawl generates a significant amount of data for the SolutionModify the crawl method to allow the user to indicate what elements, included by default in the spider, can be removed. I'm not sure if this is an edge case and you'd rather leave the process I described out of your logic (at the end of the day, it's doable easily), but I wanted to share it with you :) What's your take on this? As always, thank you ! |
Beta Was this translation helpful? Give feedback.
-
Hi @eliasdabbas!
Firstly, thank you for creating and updating a library that is now an essential for me. Really appreciate that.
I have a couple of features that I'd live to discuss with you. I'll try to keep my comment as short as possible and let me know what you think.
dict
for our architecture and have a new column using this segmentation. For instance, the following dictionary could be passed as an argument for theadv.crawl
method.client-secrets.json
key, we could add this information for a crawl or asitemap_to_df
method call. Using this library or something custom-built if you don't want to add dependencies.I'm not sure about GA, because GA4 API is still fresh and may be updated during the upcoming months.
sitemap_to_df
andurl_to_df
methods together with apd.merge
to have all the information I need. Won't it make sense to add the columns you already have in theurl_to_df
when you callsitemap_to_df
?Thanks in advance for your feedback!
Beta Was this translation helpful? Give feedback.
All reactions