A robust Python script that permits you to scrape messages and media from Telegram channels utilizing the Telethon library. Options embrace real-time steady scraping, media downloading, and knowledge export capabilities.
___________________ _________
__ ___/ _____/ / _____/
| | / ___ _____
| | _ /
|____| ______ /_______ /
/ /
Options 🚀
- Scrape messages from a number of Telegram channels
- Obtain media recordsdata (images, paperwork)
- Actual-time steady scraping
- Export knowledge to JSON and CSV codecs
- SQLite database storage
- Resume functionality (saves progress)
- Media reprocessing for failed downloads
- Progress monitoring
- Interactive menu interface
Stipulations 📋
Earlier than operating the script, you may want:
- Python 3.7 or increased
- Telegram account
- API credentials from Telegram
Required Python packages
pip set up -r necessities.txt
Contents of necessities.txt
:
telethon
aiohttp
asyncio
Getting Telegram API Credentials 🔑
- Go to https://my.telegram.org/auth
- Log in together with your cellphone quantity
- Click on on “API improvement instruments”
- Fill within the type:
- App title: Your app title
- Brief title: Your app quick title
- Platform: Could be left as “Desktop”
- Description: Transient description of your app
- Click on “Create software”
- You will obtain:
api_id
: A quantityapi_hash
: A string of letters and numbers
Preserve these credentials secure, you may want them to run the script!
Setup and Working 🔧
- Clone the repository:
git clone https://github.com/unnohwn/telegram-scraper.git
cd telegram-scraper
- Set up necessities:
pip set up -r necessities.txt
- Run the script:
python telegram-scraper.py
- On first run, you may be prompted to enter:
- Your API ID
- Your API Hash
- Your cellphone quantity (with nation code)
- Your cellphone quantity (with nation code) or bot, however use the cellphone quantity choice when prompted second time.
- Verification code (despatched to your Telegram)
Preliminary Scraping Habits 🕒
When scraping a channel for the primary time, please be aware:
- The script will try and retrieve all the channel historical past, ranging from the oldest messages
- Preliminary scraping can take a number of minutes and even hours, relying on:
- The overall variety of messages within the channel
- Whether or not media downloading is enabled
- The scale and variety of media recordsdata
- Your web connection velocity
- Telegram’s price limiting
- The script makes use of pagination and maintains state, so if interrupted, it could resume from the place it left off
- Progress share is displayed in real-time to trace the scraping standing
- Messages are saved within the database as they’re scraped, so you can begin analyzing accessible knowledge even earlier than the scraping is full
Utilization 📝
The script gives an interactive menu with the next choices:
- [A] Add new channel
- Enter the channel ID or channelname
- [R] Take away channel
- Take away a channel from scraping record
- [S] Scrape all channels
- One-time scraping of all configured channels
- [M] Toggle media scraping
- Allow/disable downloading of media recordsdata
- [C] Steady scraping
- Actual-time monitoring of channels for brand new messages
- [E] Export knowledge
- Export to JSON and CSV codecs
- [V] View saved channels
- Listing all saved channels
- [L] Listing account channels
- Listing all channels with ID:s for account
- [Q] Stop
Channel IDs 📢
You should utilize both: – Channel username (e.g., channelname
) – Channel ID (e.g., -1001234567890
)
Information Storage 💾
Database Construction
Information is saved in SQLite databases, one per channel: – Location: ./channelname/channelname.db
– Desk: messages
– id
: Major key – message_id
: Telegram message ID – date
: Message timestamp – sender_id
: Sender’s Telegram ID – first_name
: Sender’s first title – last_name
: Sender’s final title – username
: Sender’s username – message
: Message textual content – media_type
: Kind of media (if any) – media_path
: Native path to downloaded media – reply_to
: ID of replied message (if any)
Media Storage 📁
Media recordsdata are saved in: – Location: ./channelname/media/
– Recordsdata are named utilizing message ID or unique filename
Exported Information 📊
Information might be exported in two codecs: 1. CSV: ./channelname/channelname.csv
– Human-readable spreadsheet format – Simple to import into Excel/Google Sheets
- JSON:
./channelname/channelname.json
- Structured knowledge format
- Ideally suited for programmatic processing
Options in Element 🔍
Steady Scraping
The continual scraping characteristic ([C]
choice) permits you to: – Monitor channels in real-time – Mechanically obtain new messages – Obtain media because it’s posted – Run indefinitely till interrupted (Ctrl+C) – Maintains state between runs
Media Dealing with
The script can obtain: – Images – Paperwork – Different media varieties supported by Telegram – Mechanically retries failed downloads – Skips present recordsdata to keep away from duplicates
Error Dealing with 🛠️
The script consists of: – Automated retry mechanism for failed media downloads – State preservation in case of interruption – Flood management compliance – Error logging for failed operations
Limitations ⚠️
- Respects Telegram’s price limits
- Can solely entry public channels or channels you are a member of
- Media obtain dimension limits apply as per Telegram’s restrictions
Contributing 🤝
Contributions are welcome! Please be at liberty to submit a Pull Request.
License 📄
This challenge is licensed beneath the MIT License – see the LICENSE file for particulars.
Disclaimer ⚖️
This software is for instructional functions solely. Be certain to: – Respect Telegram’s Phrases of Service – Acquire needed permissions earlier than scraping – Use responsibly and ethically – Adjust to knowledge safety laws