-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index media, reply status, quote status and extended URL #393
Conversation
- Saves mediaID, media type and tweetID in the media table - Saves hasMedia to true if there is media. Using hasMedia and tweetID in the media table, one can retrieve information about all the media associated with a tweet - Downloads and saves the actual media files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great so far.
Can you write tests for it, and include blocks of JSON as testdata? So, in a test account, you can save an API response that includes tweets with media (and later, with replies, quote tweets, links, etc.) as JSON in the testdata
folder, and then write tests that load it. (Also if you can make any improvements on the testing situation, that would be great...)
I think we should capture more info in the tweet_media
table too. In my test account, I posted two tweets, one with an image and one without media. Here is the tweet with an image:
Even though the tweet does not have a link, here is the text
field that is pulled from the JSON: a fun game https://t.co/FuXqivJIdE
Here is a piece of the entities[].media
object:
{
"display_url": "pic.x.com/FuXqivJIdE",
"expanded_url": "https://x.com/snowyfoxmatch/status/1887628076441608562/photo/1",
"id_str": "1887628033273831427",
"indices": [
11,
34
],
"media_key": "3_1887628033273831427",
"media_url_https": "https://pbs.twimg.com/media/GjIzsfdbIAMxByO.jpg",
"type": "photo",
"url": "https://t.co/FuXqivJIdE",
"ext_media_availability": {
"status": "Available"
},
Ultimately, when we repost this tweet to Bluesky or Mastodon, we want will want the text to just say: a fun game
and to strip the https://t.co/FuXqivJIdE
link. In order to do this, we will need to save url
as well. (And potentially the values in indices
, as these will tell us where in the text
string the link is, though we can also find that by searching the string.)
Here's the full JSON for my test tweets btw, which you could use for testing if you want:
{"data":{"user":{"result":{"__typename":"User","timeline_v2":{"timeline":{"instructions":[{"type":"TimelineClearCache"},{"type":"TimelineAddEntries","entries":[{"entryId":"tweet-1887628109589164504","sortIndex":"1887628966789906432","content":{"entryType":"TimelineTimelineItem","__typename":"TimelineTimelineItem","itemContent":{"itemType":"TimelineTweet","__typename":"TimelineTweet","tweet_results":{"result":{"__typename":"Tweet","rest_id":"1887628109589164504","core":{"user_results":{"result":{"__typename":"User","id":"VXNlcjoxODMzNjUxMDUzMTY4MDcwNjU2","rest_id":"1833651053168070656","affiliates_highlighted_label":{},"has_graduated_access":false,"parody_commentary_fan_label":"None","is_blue_verified":false,"profile_image_shape":"Circle","legacy":{"protected":true,"following":false,"can_dm":true,"can_media_tag":false,"created_at":"Tue Sep 10 23:37:56 +0000 2024","default_profile":true,"default_profile_image":false,"description":"","entities":{"description":{"urls":[]}},"fast_followers_count":0,"favourites_count":0,"followers_count":0,"friends_count":0,"has_custom_timelines":false,"is_translator":false,"listed_count":0,"location":"","media_count":1,"name":"snowyfoxmatch","needs_phone_verification":false,"normal_followers_count":0,"pinned_tweet_ids_str":[],"possibly_sensitive":false,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1833667031700586496/a_XCrxOt_normal.jpg","profile_interstitial_type":"","screen_name":"snowyfoxmatch","statuses_count":2,"translator_type":"none","verified":false,"want_retweets":false,"withheld_in_countries":[]},"tipjar_settings":{}}}},"unmention_data":{},"edit_control":{"edit_tweet_ids":["1887628109589164504"],"editable_until_msecs":"1738884186000","is_edit_eligible":true,"edits_remaining":"5"},"is_translatable":false,"views":{"state":"Enabled"},"source":"<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Twitter Web App</a>","grok_analysis_button":true,"legacy":{"bookmark_count":0,"bookmarked":false,"created_at":"Thu Feb 06 22:23:06 +0000 2025","conversation_id_str":"1887628109589164504","display_text_range":[0,20],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorite_count":0,"favorited":false,"full_text":"a post without media","is_quote_status":false,"lang":"en","quote_count":0,"reply_count":0,"retweet_count":0,"retweeted":false,"user_id_str":"1833651053168070656","id_str":"1887628109589164504"},"quick_promote_eligibility":{"eligibility":"IneligibleNotProfessional"}}},"tweetDisplayType":"Tweet"},"clientEventInfo":{"component":"tweet","element":"tweet","details":{"timelinesDetails":{"injectionType":"RankedOrganicTweet","controllerData":"DAACDAABDAABCgABAAAAAAAAAAAKAAkZcm/q8hqQAAAAAAA="}}}}},{"entryId":"tweet-1887628076441608562","sortIndex":"1887628966789906431","content":{"entryType":"TimelineTimelineItem","__typename":"TimelineTimelineItem","itemContent":{"itemType":"TimelineTweet","__typename":"TimelineTweet","tweet_results":{"result":{"__typename":"Tweet","rest_id":"1887628076441608562","core":{"user_results":{"result":{"__typename":"User","id":"VXNlcjoxODMzNjUxMDUzMTY4MDcwNjU2","rest_id":"1833651053168070656","affiliates_highlighted_label":{},"has_graduated_access":false,"parody_commentary_fan_label":"None","is_blue_verified":false,"profile_image_shape":"Circle","legacy":{"protected":true,"following":false,"can_dm":true,"can_media_tag":false,"created_at":"Tue Sep 10 23:37:56 +0000 2024","default_profile":true,"default_profile_image":false,"description":"","entities":{"description":{"urls":[]}},"fast_followers_count":0,"favourites_count":0,"followers_count":0,"friends_count":0,"has_custom_timelines":false,"is_translator":false,"listed_count":0,"location":"","media_count":1,"name":"snowyfoxmatch","needs_phone_verification":false,"normal_followers_count":0,"pinned_tweet_ids_str":[],"possibly_sensitive":false,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1833667031700586496/a_XCrxOt_normal.jpg","profile_interstitial_type":"","screen_name":"snowyfoxmatch","statuses_count":2,"translator_type":"none","verified":false,"want_retweets":false,"withheld_in_countries":[]},"tipjar_settings":{}}}},"unmention_data":{},"edit_control":{"edit_tweet_ids":["1887628076441608562"],"editable_until_msecs":"1738884178000","is_edit_eligible":true,"edits_remaining":"5"},"is_translatable":false,"views":{"state":"Enabled"},"source":"<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Twitter Web App</a>","grok_analysis_button":true,"legacy":{"bookmark_count":0,"bookmarked":false,"created_at":"Thu Feb 06 22:22:58 +0000 2025","conversation_id_str":"1887628076441608562","display_text_range":[0,10],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/FuXqivJIdE","expanded_url":"https://x.com/snowyfoxmatch/status/1887628076441608562/photo/1","id_str":"1887628033273831427","indices":[11,34],"media_key":"3_1887628033273831427","media_url_https":"https://pbs.twimg.com/media/GjIzsfdbIAMxByO.jpg","type":"photo","url":"https://t.co/FuXqivJIdE","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":390,"h":34,"w":34},{"x":355,"y":141,"h":53,"w":53}]},"medium":{"faces":[{"x":492,"y":390,"h":34,"w":34},{"x":355,"y":141,"h":53,"w":53}]},"small":{"faces":[{"x":446,"y":353,"h":30,"w":30},{"x":321,"y":127,"h":48,"w":48}]},"orig":{"faces":[{"x":492,"y":390,"h":34,"w":34},{"x":355,"y":141,"h":53,"w":53}]}},"sizes":{"large":{"h":502,"w":750,"resize":"fit"},"medium":{"h":502,"w":750,"resize":"fit"},"small":{"h":455,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":502,"width":750,"focus_rects":[{"x":0,"y":82,"w":750,"h":420},{"x":248,"y":0,"w":502,"h":502},{"x":310,"y":0,"w":440,"h":502},{"x":499,"y":0,"w":251,"h":502},{"x":0,"y":0,"w":750,"h":502}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1887628033273831427"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/FuXqivJIdE","expanded_url":"https://x.com/snowyfoxmatch/status/1887628076441608562/photo/1","id_str":"1887628033273831427","indices":[11,34],"media_key":"3_1887628033273831427","media_url_https":"https://pbs.twimg.com/media/GjIzsfdbIAMxByO.jpg","type":"photo","url":"https://t.co/FuXqivJIdE","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":390,"h":34,"w":34},{"x":355,"y":141,"h":53,"w":53}]},"medium":{"faces":[{"x":492,"y":390,"h":34,"w":34},{"x":355,"y":141,"h":53,"w":53}]},"small":{"faces":[{"x":446,"y":353,"h":30,"w":30},{"x":321,"y":127,"h":48,"w":48}]},"orig":{"faces":[{"x":492,"y":390,"h":34,"w":34},{"x":355,"y":141,"h":53,"w":53}]}},"sizes":{"large":{"h":502,"w":750,"resize":"fit"},"medium":{"h":502,"w":750,"resize":"fit"},"small":{"h":455,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":502,"width":750,"focus_rects":[{"x":0,"y":82,"w":750,"h":420},{"x":248,"y":0,"w":502,"h":502},{"x":310,"y":0,"w":440,"h":502},{"x":499,"y":0,"w":251,"h":502},{"x":0,"y":0,"w":750,"h":502}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1887628033273831427"}}}]},"favorite_count":0,"favorited":false,"full_text":"a fun game https://t.co/FuXqivJIdE","is_quote_status":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quote_count":0,"reply_count":0,"retweet_count":0,"retweeted":false,"user_id_str":"1833651053168070656","id_str":"1887628076441608562"},"quick_promote_eligibility":{"eligibility":"IneligibleNotProfessional"}}},"tweetDisplayType":"Tweet"},"clientEventInfo":{"component":"tweet","element":"tweet","details":{"timelinesDetails":{"injectionType":"RankedOrganicTweet","controllerData":"DAACDAABDAABCgABAAAAAAAAAAAKAAkZcm/q8hqQAAAAAAA="}}}}},{"entryId":"who-to-follow-1887628966789906434","sortIndex":"1887628966789906430","content":{"entryType":"TimelineTimelineModule","__typename":"TimelineTimelineModule","items":[{"entryId":"who-to-follow-1887628966789906434-user-10671602","item":{"itemContent":{"itemType":"TimelineUser","__typename":"TimelineUser","user_results":{"result":{"__typename":"User","id":"VXNlcjoxMDY3MTYwMg==","rest_id":"10671602","affiliates_highlighted_label":{},"has_graduated_access":true,"parody_commentary_fan_label":"None","is_blue_verified":true,"profile_image_shape":"Square","legacy":{"following":false,"can_dm":false,"can_media_tag":false,"created_at":"Tue Nov 27 22:37:20 +0000 2007","default_profile":false,"default_profile_image":false,"description":"Official Sony Interactive Entertainment account. Updates on PS5, PlayStation VR2, PlayStation Plus, PS4 and more. Support: @AskPlayStation","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"playstation.com","expanded_url":"https://www.playstation.com","url":"https://t.co/OrydDMkek8","indices":[0,23]}]}},"fast_followers_count":0,"favourites_count":1453,"followers_count":42345013,"friends_count":752,"has_custom_timelines":true,"is_translator":false,"listed_count":34426,"location":"California","media_count":26512,"name":"PlayStation","normal_followers_count":42345013,"pinned_tweet_ids_str":["1876618946146677236"],"possibly_sensitive":false,"profile_banner_url":"https://pbs.twimg.com/profile_banners/10671602/1738705748","profile_image_url_https":"https://pbs.twimg.com/profile_images/1833447364138299392/AXIZsQe4_normal.jpg","profile_interstitial_type":"","screen_name":"PlayStation","statuses_count":46642,"translator_type":"none","url":"https://t.co/OrydDMkek8","verified":false,"verified_type":"Business","want_retweets":false,"withheld_in_countries":[]},"tipjar_settings":{}}},"userDisplayType":"User"},"clientEventInfo":{"component":"suggest_who_to_follow","element":"user","details":{"timelinesDetails":{"injectionType":"WhoToFollow","sourceData":"DAABCgABHlvecZFOhsEKAAIAAAAAAAAAAAAIAAIAAARgCAADAAAAAgA="}}}}},{"entryId":"who-to-follow-1887628966789906434-user-24742040","item":{"itemContent":{"itemType":"TimelineUser","__typename":"TimelineUser","user_results":{"result":{"__typename":"User","id":"VXNlcjoyNDc0MjA0MA==","rest_id":"24742040","affiliates_highlighted_label":{},"has_graduated_access":true,"parody_commentary_fan_label":"None","is_blue_verified":true,"profile_image_shape":"Square","legacy":{"following":false,"can_dm":false,"can_media_tag":false,"created_at":"Mon Mar 16 18:30:52 +0000 2009","default_profile":false,"default_profile_image":false,"description":"Are you telling me an X made this Box?","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"Xbox.com","expanded_url":"http://Xbox.com","url":"https://t.co/C1bIH5PDZK","indices":[0,23]}]}},"fast_followers_count":0,"favourites_count":6956,"followers_count":24680499,"friends_count":15052,"has_custom_timelines":true,"is_translator":false,"listed_count":21721,"location":"Redmond, Washington","media_count":44234,"name":"Xbox","normal_followers_count":24680499,"pinned_tweet_ids_str":["1887260502482616761"],"possibly_sensitive":false,"profile_banner_url":"https://pbs.twimg.com/profile_banners/24742040/1738864839","profile_image_url_https":"https://pbs.twimg.com/profile_images/1885376956990185478/5pKp_1Ti_normal.jpg","profile_interstitial_type":"","screen_name":"Xbox","statuses_count":316697,"translator_type":"none","url":"https://t.co/C1bIH5PDZK","verified":false,"verified_type":"Business","want_retweets":false,"withheld_in_countries":[]},"tipjar_settings":{}}},"userDisplayType":"User"},"clientEventInfo":{"component":"suggest_who_to_follow","element":"user","details":{"timelinesDetails":{"injectionType":"WhoToFollow","sourceData":"DAABCgABHlvecZFOhsEKAAIAAAAAAAAAAAAIAAIAAARgCAADAAAAAgA="}}}}},{"entryId":"who-to-follow-1887628966789906434-user-11928542","item":{"itemContent":{"itemType":"TimelineUser","__typename":"TimelineUser","user_results":{"result":{"__typename":"User","id":"VXNlcjoxMTkyODU0Mg==","rest_id":"11928542","affiliates_highlighted_label":{},"has_graduated_access":true,"parody_commentary_fan_label":"None","is_blue_verified":true,"profile_image_shape":"Circle","legacy":{"following":false,"can_dm":false,"can_media_tag":false,"created_at":"Mon Jan 07 04:09:59 +0000 2008","default_profile":false,"default_profile_image":false,"description":"Play Dark Cloud 2","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"kotaku.com","expanded_url":"http://kotaku.com","url":"https://t.co/gMN9Ce6bJ3","indices":[0,23]}]}},"fast_followers_count":0,"favourites_count":227,"followers_count":2632714,"friends_count":199,"has_custom_timelines":true,"is_translator":false,"listed_count":12590,"location":"","media_count":83900,"name":"Kotaku","normal_followers_count":2632714,"pinned_tweet_ids_str":[],"possibly_sensitive":false,"profile_banner_url":"https://pbs.twimg.com/profile_banners/11928542/1559318957","profile_image_url_https":"https://pbs.twimg.com/profile_images/1145899315006717952/ozxwJgmx_normal.png","profile_interstitial_type":"","screen_name":"Kotaku","statuses_count":119397,"translator_type":"none","url":"https://t.co/gMN9Ce6bJ3","verified":false,"want_retweets":false,"withheld_in_countries":[]},"tipjar_settings":{}}},"userDisplayType":"User"},"clientEventInfo":{"component":"suggest_who_to_follow","element":"user","details":{"timelinesDetails":{"injectionType":"WhoToFollow","sourceData":"DAABCgABHlvecZFOhsEKAAIAAAAAAAAAAAAIAAIAAARgCAADAAAAAgA="}}}}}],"displayType":"Vertical","header":{"displayType":"Classic","text":"Who to follow","sticky":false},"footer":{"displayType":"Classic","text":"Show more","landingUrl":{"url":"twitter://connect_people?user_id=1833651053168070656&display_location=profile_wtf_showmore","urlType":"DeepLink"}},"clientEventInfo":{"component":"suggest_who_to_follow","details":{"timelinesDetails":{"injectionType":"WhoToFollow","sourceData":"DAABCgABHlvecZFOhsEKAAIAAAAAAAAAAAAIAAIAAARgCAADAAAAAgA="}}}}},{"entryId":"cursor-top-1887628966789906433","sortIndex":"1887628966789906433","content":{"entryType":"TimelineTimelineCursor","__typename":"TimelineTimelineCursor","value":"DAABCgABGjI0i1FAJxEKAAIaMjPDvBqx2AgAAwAAAAEAAA","cursorType":"Top"}},{"entryId":"cursor-bottom-1887628966789906429","sortIndex":"1887628966789906429","content":{"entryType":"TimelineTimelineCursor","__typename":"TimelineTimelineCursor","value":"DAABCgABGjI0i1E___sKAAIaMjO8BFshcggAAwAAAAIAAA","cursorType":"Bottom"}}]}],"metadata":{"scribeConfig":{"page":"profileBest"}}}}}}}}
Yes, the media entities does have a lot of information, do you think we should probably capture all the informations like |
e21caf3
to
1b36b14
Compare
I think |
1b36b14
to
b2bf23f
Compare
…to use camelCase for indexStart and indexEnd for consistency
…Ls from the db before indexing if they have previously been indexed
@micahflee added the importing information from archive part . Quote informations are not there in archives, so left those out. |
Support for indexing video (and fix indexing URLs)
2998517
to
21dba91
Compare
…es media list, not the length of it. Rename start_index and end_index to startIndex and endIndex. And when importing URLs, delete the URL first if it already exists to prevent unique contraint error.
Media in Cyd's local X archive
…t already exists to make sure we get media/links/etc
… it, and then the import fails because it is not looking in the folder it unzipped into
I just merged #402 into this branch, so now we only have one PR for this feature for you to review @redshiftzero. I fixed an existing bug where when you import an X archive from a folder (not a zip file), it would delete the original X archive. Now it should only delete the temp folder if it unzips it for you. When testing, make sure you have an X account that includes tweets with images, videos, URLs, and quote tweets. Some things to test:
|
@micahflee thanks for taking care of the bugs! Looks good to me. |
Tested these three scenarios: 1. Build archive from scratch ✅All looks good! Replies, video, quote tweet all good. 2. Build archive from X export ❌One issue: videos didn't appear either in the "Tweet Media" folder or in the Cyd archive UI. This did work for scenario 1. (Note: I didn't test replies in step 2 as my X archive didn't have them as I only added replies for review of this PR an few hours ago) 3. Upgrade ✅Tested with archive created on All looks good! |
|
||
// If file doesn't exist in archive, don't save information in db | ||
if (!fs.existsSync(archiveMediaFilename)) { | ||
log.info(`XAccountController.saveXArchiveMedia: media file not found: ${archiveMediaFilename}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the videos for loading from the export, I'm hitting this log line here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the log line:
22:27:32.890 › XAccountController.saveXArchiveMedia: media file not found: /Users/redshiftzero/Documents/Cyd Dev/X/$TEST_ACCOUNT_USERNAME/tmp/data/tweets_media/1891594792125010175-kvvBMojc1QwudY9d.mp4
the file in question is named data/tweets_media/1891594792125010175-jq7vI73puwizA_En.mp4
in the archive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the media
object, there are various bitrate options, and it seems like the one by default in the archive is the highest bitrate one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added 7deb52c which fixes for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a small commit resolving the video import from the X archive file, all LGTM now!
Refs #354
This PR is still a work in progress. This PR saves media to the local filesytem, creates a table to store the media key, type and tweetID to link it with the correct tweets. Also updates the tweet table to have a hasMedia column.
Things left to do: