-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bizarre duplication of metadata on different vods #1263
Comments
Unfortunately because the VODs are expired I cannot do any debugging, however there are several details (such as the chapter ids) that suggest the issue was probably not created by TwitchDownloader. |
For now, you will just need to manually fix it, however if you notice it again while the VOD is still available, let me know and I'll look into it. |
Will do. Agreed, I dont think this is the downloader. It looks like twitch's API having a stroke in a completely nonsensical manner. The good news is the way ive set up my systems now, i'll be alerted as soon as a response like this is detected. If/when it happens, i'll update this issue asap. |
It wouldn't be the first time Twitch's API delivered wrong or incomplete data. If this becomes a reoccurring issue, we will likely need to either change how we fetch VOD chapters or add logic to combine the extra chapters. Checking TwitchTracker, it doesn't seem to have the same issues, but it might also use one of the other API calls for fetching VOD chapters. |
It's not just the chapters. The vod ID is also nonsense. That's really what i ran into first and noticed the chapter problem on top of it. I'm using the vod id as a database pk and got a file hash mismatch warning when importing since it thought these were the same vod, but clearly they're not. |
Oh, I completely missed that. I genuinely have no clue how that could have happened as the ID stored in the JSON file is the exact same object as is used to fetch video info. The ID object is also never mutated, meaning this makes absolutely zero sense. My only guess would have to be that at some point the file was mutated by another program. Either that or Twitch reused video this specific ID for the exact same streamer, which seems unlikely. |
That's why I'm suspicious of the twitch API. I'm using a modified version of https://github.com/cr08/TwitchVault to manage the data. It pulls a list of vods on the configured user and downloads them if they dont already exist. This seems to take into account the name associated to the vod as well, which technically could lead to an issue like this if the vod was to be renamed. However, that's not what I'm seeing here because the data overlap is so bizarre. Clearly the API saw a vod with the same id as another at some point and downloaded the data associated with the id. That download had a different name and therefore wasn't seen automatically as a dupe by the wrapper. It invokes the twtichdownloader cli using the video ID as the source and sets the output to the vod id concatenated with the title as reported by twitch. In short, the twitch api did in fact at some point report a vod with this kind of bizarre data. The wrapper code and twitchdownloader hit the api independently and came to the same conclusion. I didn't link it here because it's not from this package, but the metadata reported by that wrapper agrees with what was in the chat download and the audio file it downloaded at the time is completely different. TLDR i'm pretty confident twitch's api had a stroke. At least twice.
-rw-r--r-- 1 ethan ethan 3.0M Oct 3 21:47 '20241002 T225528Z - 2266136455 - THE CRITTER IS HEARTGOLD ARTLOCKE LURK DISCORD_archive_chat.json'
-rw-r--r-- 1 ethan ethan 3.0M Oct 3 00:01 '20241002 T225528Z - 2266136455 - THE CRITTER IS MONSTER DESIGN CONTEST LURK DISCORD_archive_chat.json' -rw-r--r-- 1 ethan ethan 3.9M Oct 23 22:16 '20241016 T225454Z - 2277847598 - THE CRITTER IS BACK TO NUZLOCKING GRIND LURK DISCORD_archive_chat.json'
-rw-r--r-- 1 ethan ethan 3.9M Oct 18 00:09 '20241016 T225454Z - 2277847598 - THE CRITTER IS ON THE COMMISSION GRIND LURK DISCORD_archive_chat.json' Both instances here exhibit the same behavior with the chapters, start time, end time, length, link, id, etc but the chats are correct somehow. Edit: I went through my data and can also confirm this returned the chat data is for the wrong vod too. Twitch what the hell is going on over there? For reference, this is my importer tool output. It noticed the issue in the data. Implementing the hash was in case i manually had to update data. Never thought it would end up catching this. https://gist.github.com/EthanZeigler/3d84da575c4ea3a91989beb2d3c61118 |
After checking I think what may be happening. The stream started at 16 September 2024 around 23:00. So what can be happening is the following: Instead of having the metadata.txt like TwitchDownloaderCLI does, and choose only one of the titles for the main title, you split the metadata in two and converted to json, so there's one chapter on each one and with different title. But titles and games may not change at the same time. Both json are part of the same stream and have the same id. The metadata.txt could include the game and the title in each chapter, but there's also the possibility that games and titles do not change at the same time so it will be more complex. you can see the stream info here: https://streamscharts.com/channels/fourleafisland/streams/52006587565 the duration of each chapter is wrong in your json. The metadata.txt once reconstructed will be something like:
the game and title inside chapters have to be put in same place because there isn't other tags. Same for gameid and all the other if you want to add them but the title is too long. the title change happens a lot when the guy starts a new stream but keeps the title from the previous one and updates it. |
Streamscharts encountered the same issue with the Twitch API. TwitchTracker correctly shows the they were 2 different streams; one on Sept. 13th and the other on Sept. 16th. |
Appreciate the effort! I think you've misunderstood the problem a bit though. This isn't a case of UTC rollover. That was one of the first things I checked. In fact, the api in question returns data in local time where the stream did not cross the date barrier. This is the twitch api returning bogus data for a stream id. |
I guess my next question is why twitch tracker has this correct? I'm trying to build a self-hosted product that does similar things as these products but with more depth |
The stream on Sept. 13th is not related to this issue, it's a different stream, it happens that has the same title as one of the chapters of the other, but it's another stream: https://twitchtracker.com/fourleafisland/streams/44810626811 The issue you are seeing is that streams with UTC rollover from 16th to 17th are displayed in both days in the table in twitchtracker, but once you enter the details it's placed in either of them not in both days. For example twitchtracker: Monday 16th is missing: the stream will be placed on 17th sept. https://twitchtracker.com/fourleafisland/streams/52006587565 Streamcharts places it on 16th, same id 52006587565: https://streamscharts.com/channels/fourleafisland/streams/52006587565 one places it the day when starts and other when ends but the only important time is the "created_at" |
The id 2252936436 is the same as 52006587565 because this one is for the vod and the other for the live stream. There's a Reddit post about it. |
I don't understand exactly which is the bogus data, could you tell please? |
I think you are not understanding it correctly, in both cases this is the same date and id:
Because it's exactly the same video but there is one json for each chapter. The other date you see is when you created the json:
|
It created one json for each chapter despite having the chapters below you can see they are repeated exactly in both cases so that is redundant:
|
If you want to use trackers to do very accurate and in depth things forget about it, they are not as reliable as if you can catch the streams when are live on twitch or in the vods section. They are just to double check in most cases it's ok. |
The start and end times are both the same in both json, it's not calculating the corresponding times for each chapter:
The program you are using to create those json does not work well. If it's TwitchVault it hasn't been updated since Oct 19, 2023. |
@superbonaci the program that's making those json files... is twitchdownloader. Twitchvault is a wrapper around twitchdownloader that automatically runs the download for unsaved vods, clips, chats, etc. Any other generation related to twitchvault is storing the raw information returned from the twitch graphql api directly in the same exact manner this project does. See https://github.com/cr08/TwitchVault/blob/c5eda2d5e3d4bf6a6f0b9801f322916dfe143b05/videos.py#L150-L246. It just runs twitch downloader via a shell command. Chapters are only handled within twitch downloader. Vault, for vod and chat downloads (i'm not using clips), doesnt care about them. It's just reading the same graphql api.
Twitch vault only stores basic metadata from the graphql api. The way im using it means it doesnt need anything more. https://github.com/cr08/TwitchVault/blob/c5eda2d5e3d4bf6a6f0b9801f322916dfe143b05/utils.py#L84 |
TLDR this is on twitch and probably not something either project can do anything about. I made this issue less for "downloader is messing up" and more for "be aware this is a thing twitch's api just decides to do" |
But metadata.txt has the correct information. The only thing that is missing is when there are title change that only uses one of them not sure which one. |
Looks like a lot of things get mixed up in the json files. |
The chat json metadata files are prepared almost identically to the ffmpeg metadata files. I can assure you that this is not a problem with how TD handles data from Twitch. |
I have another example of bizarre behavior now from november 30th, but i need to double check what we're seeing is the same issue before sharing |
Yes if you are going to share it make sure it's a vod that we can download and check. |
This is the metadata.txt:
The only difference I see is that he changed the title of the stream 4 times, the last one is the current one on Twitch and maybe it was changed after stream end: https://streamscharts.com/channels/fourleafisland/streams/43477083080
You should not save the stream with the title or chapter (game), because can be changed several times during the stream and also after the stream. Why do you bother doing it that way? Save with the video ID or something, do not save by chapters unless you are sure there are no bugs. Still I don't see what's your issue. |
I don't know if there can be any bug when title and game don't change at the same time, for example: 0 - start with one title and one game The metadata.txt will only have one title and the 2 games (the chapters), but if you try to to be very precise and log everything some bugs may happen. I don't know if the Twitch API can be that accurate, all that information should be available somehow after the stream ends, mabe @ScrubN knows... @EthanZeigler What is that you want to do exactly, what do you find wrong? |
It's working fine now, I think you just need to wait until the stream ends and Twitch processes correctly the video. Also if there are any after end edits it should be as well.
|
interesting. that gives me a theory on what might be happening then. Is it possible the automation for finding the channel ids is running into a race condition with the vod finalization process? Will experiment and report back in the future. |
You mean the game ID, which is always the same for each game? The channel and the user have their own ID but that's a different thing.
Edit: looks like the chapters have also their own id which should not duplicate:
Each VOD has its own ID and also each live stream which are not the same as the VOD. There are so many types of ID that you must be sure what you are looking for, I'm not sure if that's even documented. |
I think the duration 0:
Is because you are reading the info while the stream is live, check here:
|
Okay, i've confirmed it. It's a race condition bug within the twitch api. Between when a stream has ended and the vod is still being finalized, the api returns incorrect information for the previous vod. This leads to the script i'm using identifying the already downloaded vod as different one than previously downloaded since the title and first chapter changes. It redownloads that vod data and stores it under the incorrect name as returned by the api. For example, if the just finished stream is going to be assigned id 345 and the last recorded and available vod is In a more distilled form, Stream runningvods = api.get_vods()
-> [123]
for vod in vods:
if vod <timestamp>_<id>_<title> exists:
print("vod {vod} exists")
continue
else:
print("vod {vod} downloading")
api.download_vod(vod)
-> "vod 123|abc exists" Stream ended, vod finalizedvods = api.get_vods()
-> [123, 345]
for vod in vods:
if vod <timestamp>_<id>_<title> exists:
print("vod {vod} exists")
continue
else:
print("vod {vod} downloading")
api.download_vod(vod)
-> "vod 123|abc exists\nvod 345|cde downloading" Stream ended, vod not finalized (bug)vods = api.get_vods()
-> [123]
for vod in vods:
if vod <timestamp>_<id>_<title> exists:
print("vod {vod} exists")
continue
else:
print("vod {vod} downloading")
api.download_vod(vod)
-> "vod 123|xyz downloading" As far as twitch downloader is concerned, idk how one works around this. Without knowing what the vod title should be, there's no way i can think of to verify the returned data is correct. |
What is the metadata.txt that returns TwitchDownloaderCLI? Isn't that good enough? |
Checklist
Edition
Command Line Interface
Describe your issue here
I'm not sure how to describe this other than just showing.
I have been using this tool to download vods from a friend of mine for months and inserting everything into a searchable database. That isn't important for the bug but will make explaining the circumstances easier. I was using the latest version at the time of the bugs. The issues occurred in september and october.
Over this time period i've accumulated over 200 vod chats and while implementing file hashing to not reprocess data, I ran into this: 2 vod downloads where the metadata appears to have been merged and mangled in a way that makes absolutely no sense. For the record, these are different vods. They did not occur on the same day at the same time. They should be completely distinct. And yet, they're not?
I'm aware this is likely a bug on twitch's end, but its so unusual I felt i needed to put it here in case anyone else has run into the same behavior. The affected vods are in september and october of 2024. For privacy reasons I've redacted the name and ID of the streamer but can provide them privately if requested. Or you can just look up the vod id if it's floating around, but i wanted to avoid search engines indexing the name.
Does anyone know what on earth happened here???
Add any related files or extra information here
No response
The text was updated successfully, but these errors were encountered: