Replies: 2 comments 5 replies
-
It seems you are trying to get a subset of columns if they exist in the crawl data frame. You can use the columns_regex = 'h1|h2|h3|title|meta_desc|status'
filtered_df = crawl_df.filter(regex=columns_regex) This would give you the columns that you want if found. Otherwise, they won't be included. Is this what you want? |
Beta Was this translation helpful? Give feedback.
4 replies
-
Thanks for the detailed explanation.
What I am try to achieve is to save values into the database. I am trying
to build an app for personal use.
Idea is:
1, A list of websites to be uploaded via UI.
2. Crawling is scheduled and done periodically
3. All of the page elements are being stored in the database
5. All of the further handling/processing is done within app using records
in db instead of csv.
4. Get notifications in case anything changes
So, as of now we are talking about step 3 where I have created classes
along with all possible tags/elements in my app.
So, need to iterate each element in the Dataframe and first check if it
exists or not. Code is working fine. It is just that it is intensive and
potentially slow.
PS: In the above flow pandas is only used as a csv reader. Do you think
it’s better to use any other alternatives like built in csv module?
…On Sat, 16 Apr 2022 at 10:14 PM Elias Dabbas ***@***.***> wrote:
Your question is clear, and the error you are getting is also clear and
logical. If you loop through things that you are not sure exist, one of
them might not be there and you will get the error.
Two things:
- This is more of a pandas discussion and about DataFrames and not
much about advertools
- Using dot notation for column names is not advisable in pandas (even
though it can be correct). If you have a column named "size" you won't get
that column, because DataFrames have a size attribute, and in this
case it's values is 9 (9 elements in the df):
df = pd.DataFrame({
'a': [1, 2, 3],
'b': [10, 20, 30],
'size': [22, 33, 44]
})
df.a
0 11 22 3Name: a, dtype: int64
df.size9
In most cases you should need to use iterrtuples or iterrows, because
pandas has "vectorized operations" that allow you to operate on all
elements without having to worry about looping.
df['a'] + 10 21 32 4Name: a, dtype: int64
What are you trying to achieve by iterating over the values?
—
Reply to this email directly, view it on GitHub
<#198 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKPYYB6FPSCMDKNL6QVNEDVFMUS3ANCNFSM5TQRQVLQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
@eliasdabbas I am getting error (object not found) in case a column is not present in DF (e.g. crawling random pages where you don't know if all of the elements/tags h1,h2... will be returned. So had to include all elements in if, else as below. Is there any better way to handle this situation?
df = pd.read_json(output_file, lines=True)
# print(df.head(), "df.head")
df_col = df.columns
# print(df.columns, "df_column")
for row in df.itertuples(index=False):
# print(row, "row")
if "title" in df_col:
page_title = row.title
else:
page_title = ""
if "url" in df_col:
page_url = row.url
else:
page_url = ""
if "meta_desc" in df_col:
page_description = row.meta_desc
else:
page_description = ""
if "h1" in df_col:
h1 = row.h1
else:
h1 = ""
if "h2" in df_col:
h2 = row.h2
else:
h2 = ""
if "h3" in df_col:
h3 = row.h3
else:
h3 = ""
if "h4" in df_col:
h4 = row.h4
else:
h4 = ""
if "h5" in df_col:
h5 = row.h5
else:
h5 = ""
if "h6" in df_col:
h6 = row.h6
else:
h6 = ""
if "canonical" in df_col:
canonical_url = row.canonical
else:
canonical_url = ""
if "body_text" in df_col:
body_text = row.body_text
else:
body_text = ""
if "size" in df_col:
body_size = row.size
else:
body_size = ""
if "depth" in df_col:
url_depth = row.depth
else:
url_depth = ""
if "status" in df_col:
url_status = row.status
else:
url_depth = ""
if "crawl_time" in df_col:
crawl_time = row.crawl_time
else:
crawl_time = ""
if "img_alt" in df_col:
img_alt = row.img_alt
else:
img_alt = ""
# if "img_src" in df_col:
# img_src = row.img_src
# else:
# img_src = ""
vals = {
"name": page_title,
"page_url": page_url,
"page_title": page_title,
"page_description": page_description,
"h1": h1,
"h2": h2,
"h3": h3,
"h4": h4,
"h5": h5,
"h6": h6,
"canonical_url": canonical_url,
"body_text": body_text,
"body_size": body_size,
"url_depth": url_depth,
"url_status": url_status,
"crawl_time": crawl_time,
"img_alt": img_alt,
"website_ids": self.id,
"user_id": self.user_id.id
# "img_src": img_src,
}
self._create_landing_page(vals)
Beta Was this translation helpful? Give feedback.
All reactions