Better way to iterate data frame #198

ska-ibees · 2022-04-15T14:17:25Z

ska-ibees
Apr 15, 2022

@eliasdabbas I am getting error (object not found) in case a column is not present in DF (e.g. crawling random pages where you don't know if all of the elements/tags h1,h2... will be returned. So had to include all elements in if, else as below. Is there any better way to handle this situation?

df = pd.read_json(output_file, lines=True)
# print(df.head(), "df.head")
df_col = df.columns
# print(df.columns, "df_column")
for row in df.itertuples(index=False):
# print(row, "row")
if "title" in df_col:
page_title = row.title
else:
page_title = ""
if "url" in df_col:
page_url = row.url
else:
page_url = ""
if "meta_desc" in df_col:
page_description = row.meta_desc
else:
page_description = ""
if "h1" in df_col:
h1 = row.h1
else:
h1 = ""
if "h2" in df_col:
h2 = row.h2
else:
h2 = ""
if "h3" in df_col:
h3 = row.h3
else:
h3 = ""
if "h4" in df_col:
h4 = row.h4
else:
h4 = ""
if "h5" in df_col:
h5 = row.h5
else:
h5 = ""
if "h6" in df_col:
h6 = row.h6
else:
h6 = ""
if "canonical" in df_col:
canonical_url = row.canonical
else:
canonical_url = ""
if "body_text" in df_col:
body_text = row.body_text
else:
body_text = ""
if "size" in df_col:
body_size = row.size
else:
body_size = ""
if "depth" in df_col:
url_depth = row.depth
else:
url_depth = ""
if "status" in df_col:
url_status = row.status
else:
url_depth = ""
if "crawl_time" in df_col:
crawl_time = row.crawl_time
else:
crawl_time = ""
if "img_alt" in df_col:
img_alt = row.img_alt
else:
img_alt = ""
# if "img_src" in df_col:
# img_src = row.img_src
# else:
# img_src = ""
vals = {
"name": page_title,
"page_url": page_url,
"page_title": page_title,
"page_description": page_description,
"h1": h1,
"h2": h2,
"h3": h3,
"h4": h4,
"h5": h5,
"h6": h6,
"canonical_url": canonical_url,
"body_text": body_text,
"body_size": body_size,
"url_depth": url_depth,
"url_status": url_status,
"crawl_time": crawl_time,
"img_alt": img_alt,
"website_ids": self.id,
"user_id": self.user_id.id
# "img_src": img_src,
}
self._create_landing_page(vals)

eliasdabbas · 2022-04-15T14:46:21Z

eliasdabbas
Apr 15, 2022
Maintainer

It seems you are trying to get a subset of columns if they exist in the crawl data frame.

You can use the pandas.DataFrame.filter method.
First create a regular expression of the columns you want:

columns_regex = 'h1|h2|h3|title|meta_desc|status' 

filtered_df = crawl_df.filter(regex=columns_regex)

This would give you the columns that you want if found. Otherwise, they won't be included.
In case some rows contain a certain tag and some don't, then they would appear as NaN.

Is this what you want?

4 replies

ska-ibees Apr 15, 2022
Author

Thanks for the prompt reply!

Let me try it out.

BTW, could you please share any example implementation of crawl_df.filter(regex=columns_regex) ?

eliasdabbas Apr 15, 2022
Maintainer

I already did. You can use the same code I wrote on any crawl_df you have.

ska-ibees Apr 15, 2022
Author

So, I have tested crawl_df.filter(regex=columns_regex). Well, it does the job to only keep matched column. However, when i iterate over rows in df. It throws same error:

AttributeError: 'Pandas' object has no attribute 'h6'

Basically I am trying to iterate each row:

try:
            adv.crawl(url_list,  output_file, follow_links=True)
            df = pd.read_json(output_file, lines=True)
            columns_regex = 'h1|h2|h3|h4|h5|h6'
            filtered_df = df.filter(regex=columns_regex)
            for row in filtered_df.itertuples(index=False):
                print(row.h6, "row6")

Since, h6 is missing in target page so it is not present in df as well.

So, I get below error:

AttributeError: 'Pandas' object has no attribute 'h6'

I hope my question is clear now.

eliasdabbas Apr 16, 2022
Maintainer

Your question is clear, and the error you are getting is also clear and logical. If you loop through things that you are not sure exist, one of them might not be there and you will get the error.

Two things:

This is more of a pandas discussion and about DataFrames and not much about advertools
Using dot notation for column names is not advisable in pandas (even though it can be correct). If you have a column named "size" you won't get that column, because DataFrames have a size attribute, and in this case it's values is 9 (9 elements in the df):

df = pd.DataFrame({
    'a': [1, 2, 3], 
    'b': [10, 20, 30],
    'size': [22, 33, 44]
})

df.a

0    1
1    2
2    3
Name: a, dtype: int64

df.size
9

In most cases you should need to use iterrtuples or iterrows, because pandas has "vectorized operations" that allow you to operate on all elements without having to worry about looping.

df['a'] + 1
0    2
1    3
2    4
Name: a, dtype: int64

What are you trying to achieve by iterating over the values?

ska-ibees · 2022-04-17T12:33:16Z

ska-ibees
Apr 17, 2022
Author

Thanks for the detailed explanation. What I am try to achieve is to save values into the database. I am trying to build an app for personal use. Idea is: 1, A list of websites to be uploaded via UI. 2. Crawling is scheduled and done periodically 3. All of the page elements are being stored in the database 5. All of the further handling/processing is done within app using records in db instead of csv. 4. Get notifications in case anything changes So, as of now we are talking about step 3 where I have created classes along with all possible tags/elements in my app. So, need to iterate each element in the Dataframe and first check if it exists or not. Code is working fine. It is just that it is intensive and potentially slow. PS: In the above flow pandas is only used as a csv reader. Do you think it’s better to use any other alternatives like built in csv module?

…

On Sat, 16 Apr 2022 at 10:14 PM Elias Dabbas ***@***.***> wrote: Your question is clear, and the error you are getting is also clear and logical. If you loop through things that you are not sure exist, one of them might not be there and you will get the error. Two things: - This is more of a pandas discussion and about DataFrames and not much about advertools - Using dot notation for column names is not advisable in pandas (even though it can be correct). If you have a column named "size" you won't get that column, because DataFrames have a size attribute, and in this case it's values is 9 (9 elements in the df): df = pd.DataFrame({ 'a': [1, 2, 3], 'b': [10, 20, 30], 'size': [22, 33, 44] }) df.a 0 11 22 3Name: a, dtype: int64 df.size9 In most cases you should need to use iterrtuples or iterrows, because pandas has "vectorized operations" that allow you to operate on all elements without having to worry about looping. df['a'] + 10 21 32 4Name: a, dtype: int64 What are you trying to achieve by iterating over the values? — Reply to this email directly, view it on GitHub <#198 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKPYYB6FPSCMDKNL6QVNEDVFMUS3ANCNFSM5TQRQVLQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

1 reply

eliasdabbas Apr 20, 2022
Maintainer

You might want to create a DataFrame that is standard (containing all possible values), and the values that don't exist on a particular page, might be converted to NaN?
I'm not sure honestly because it depends on what exactly you want to do, how often, and the size of the data.

The pandas CSV reader/writer is great and has various options. It depends on what you want to do, but it should be fine for most use-cases I guess.
Good luck!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better way to iterate data frame #198

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Better way to iterate data frame #198

ska-ibees Apr 15, 2022

Replies: 2 comments · 5 replies

eliasdabbas Apr 15, 2022 Maintainer

ska-ibees Apr 15, 2022 Author

eliasdabbas Apr 15, 2022 Maintainer

ska-ibees Apr 15, 2022 Author

eliasdabbas Apr 16, 2022 Maintainer

ska-ibees Apr 17, 2022 Author

eliasdabbas Apr 20, 2022 Maintainer

ska-ibees
Apr 15, 2022

Replies: 2 comments 5 replies

eliasdabbas
Apr 15, 2022
Maintainer

ska-ibees Apr 15, 2022
Author

eliasdabbas Apr 15, 2022
Maintainer

ska-ibees Apr 15, 2022
Author

eliasdabbas Apr 16, 2022
Maintainer

ska-ibees
Apr 17, 2022
Author

eliasdabbas Apr 20, 2022
Maintainer