Skip to content

Commit

Permalink
merge task-1 to main
Browse files Browse the repository at this point in the history
  • Loading branch information
tedoaba committed Oct 11, 2024
2 parents 51f25fb + 871fbaf commit ad9b980
Show file tree
Hide file tree
Showing 41 changed files with 33,334 additions and 1 deletion.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ __pycache__/
.week7/

# Data
data/


# Distribution / packaging
.Python
Expand Down
33,168 changes: 33,168 additions & 0 deletions data/telegram_medical_businesses_data.csv

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2022-12-20_07-41-33.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2022-12-20_11-28-59.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2022-12-20_17-25-05.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2022-12-21_15-13-27.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2022-12-22_03-11-26.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2022-12-22_06-40-25.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2022-12-23_06-26-15.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2022-12-27_17-06-32.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2022-12-28_06-31-50.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2022-12-28_17-02-08.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2022-12-30_15-45-35.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2023-01-02_07-02-55.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2023-01-03_05-48-34.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2023-01-03_17-49-48.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2023-01-04_05-58-02.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2023-01-06_06-05-01.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2023-01-06_09-31-17.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2023-01-06_16-06-21.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2023-01-13_09-44-14.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2023-01-13_12-48-47.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2023-01-16_09-26-09.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2023-01-16_10-13-42.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2023-01-16_13-41-35.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2023-01-17_08-43-12.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/photo_2023-01-23_10-39-20.jpg
Binary file added images/photo_2023-01-26_18-27-53.jpg
Binary file added images/photo_2023-01-27_07-18-40.jpg
Binary file added images/photo_2023-01-30_09-45-25.jpg
Binary file added images/photo_2023-01-31_09-19-53.jpg
Binary file added images/photo_2023-02-01_08-59-37.jpg
Binary file added images/photo_2023-02-02_08-58-52.jpg
Binary file added images/photo_2023-02-10_12-23-06.jpg
5 changes: 5 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pandas
numpy
matplotlib
seaborn
telethon
20 changes: 20 additions & 0 deletions scripts/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
sys.path.append(os.path.abspath('../src'))

from data_loader import load_data

def main():
file_path = '../data/telegram_medical_businesses_data.csv'
df = load_data(file_path=file_path)
print(df.head())
print(df.shape)
print(df.isnull().sum())


if __name__ == '__main__':
main()
9 changes: 9 additions & 0 deletions src/data_loader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


def load_data(file_path):
df = pd.read_csv(file_path)
return df
98 changes: 98 additions & 0 deletions src/info_scraper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
import pandas as pd
from telethon import TelegramClient
from telethon.errors import SessionPasswordNeededError
from telethon.tl.functions.messages import GetHistoryRequest
from telethon.tl.types import PeerChannel
import asyncio

# Step 1: Define Telegram API credentials
API_ID = ''
API_HASH = ''
PHONE_NUMBER = ''
# Step 2: Create a function to connect to the Telegram client
client = TelegramClient('session_name', API_ID, API_HASH)

async def connect_telegram():
await client.start(PHONE_NUMBER)
if not await client.is_user_authorized():
try:
await client.send_code_request(PHONE_NUMBER)
await client.sign_in(PHONE_NUMBER, input('Enter the code: '))
except SessionPasswordNeededError:
await client.sign_in(password=input('Enter your 2FA password: '))

# Step 3: Scraping messages from specific Telegram channels
async def scrape_channel_messages(channel_username, limit=10000):
"""
Scrapes messages from a specific Telegram channel.
Args:
channel_username (str): The username or URL of the Telegram channel.
limit (int): The maximum number of messages to scrape.
Returns:
DataFrame: A pandas DataFrame containing the scraped messages.
"""
try:
channel = await client.get_entity(PeerChannel(int(channel_username)) if channel_username.isdigit() else channel_username)
except Exception as e:
print(f"Could not access channel {channel_username}: {e}")
return pd.DataFrame()

messages = []
async for message in client.iter_messages(channel, limit=limit):
messages.append({
'message_id': message.id,
'date': message.date,
'text': message.message,
'sender_id': message.sender_id
})

return pd.DataFrame(messages)

# Step 4: Scraping data from multiple channels
async def scrape_multiple_channels(channel_list, message_limit=1000):
"""
Scrapes data from a list of Telegram channels and aggregates it into a single DataFrame.
Args:
channel_list (list): List of Telegram channel usernames or URLs.
message_limit (int): The number of messages to scrape per channel.
Returns:
DataFrame: A pandas DataFrame containing all the scraped messages.
"""
all_data = pd.DataFrame()

for channel in channel_list:
print(f"Scraping messages from {channel}...")
channel_data = await scrape_channel_messages(channel, limit=message_limit)
all_data = pd.concat([all_data, channel_data], ignore_index=True)

return all_data

# Step 5: Define the main function to run the pipeline
async def main():
# List of Telegram channels
channel_list = [
'@DoctorsET',
'@CheMed123',
'@lobelia4cosmetics',
'@yetenaweg',
'@EAHCI'
]

# Connect to Telegram
await connect_telegram()

# Scrape messages from the channels
scraped_data = await scrape_multiple_channels(channel_list)

# Save scraped data to CSV for later processing
scraped_data.to_csv('../data/telegram_medical_businesses_data.csv', index=False)
print("Data scraping complete. Data saved to 'telegram_medical_businesses_data.csv'.")

# Step 6: Start the script
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
33 changes: 33 additions & 0 deletions src/scraper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
from telethon import TelegramClient, sync
import os

# Your API ID and API Hash (from my.telegram.org)

api_id = ''
api_hash = ''

# Define the client (replace 'session_name' with any name)
client = TelegramClient('session_name', api_id, api_hash)

# Connect to Telegram
client.start()

# Define the channel username or ID (You can use ID or '@channelusername')
channel = '@CheMed123'

# Folder to save the images
image_save_path = '../images/'
os.makedirs(image_save_path, exist_ok=True)

# Fetch messages from the channel
async def scrape_channel():
async for message in client.iter_messages(channel):
# Check if the message contains media (image)
if message.photo:
# Download the photo
file_path = await message.download_media(file=image_save_path)
print(f"Image saved to {file_path}")

# Run the scraping
with client:
client.loop.run_until_complete(scrape_channel())
Binary file added src/session_name.session
Binary file not shown.

0 comments on commit ad9b980

Please sign in to comment.