statuscafe_parser.py
Je Sian Keith Herman March 08, 2024 #Python #Scripting #CSV #Data ExtractionA Python script that parses an Atom RSS feed from status.cafe and exports the data to a CSV file
# =============================================================================
#
# Author: jskherman
# Date: 2024-03-08
# Description:
# This script parses an XML file from status.cafe in the Atom format, which
# typically represents an RSS feed. The script extracts relevant information
# from the entries in the feed, such as the title, author, published
# timestamp, content, ID, and link. It then processes this information and
# writes it to a CSV file.
#
# Specifically, the script performs the following tasks:
#
# 1. Prompts the user to enter the name of the XML file to parse.
# 2. Parses the provided XML file using the xml.etree.ElementTree module.
# 3. Extracts the following information from each entry in the feed:
# - Title
# - Author
# - Published timestamp (converted to a UNIX timestamp)
# - Content (with HTML entities decoded)
# - ID (extracted from the URL in the 'id' node)
# - Link (for the 'alternate' link)
# - Emoji (extracted from the title, assuming it's the second word)
# 4. Checks for duplicate IDs in the feed entries. If any duplicates are
# found, it raises a ValueError with the duplicate IDs and their
# corresponding status updates.
# 5. Writes the extracted information to a CSV file named 'output.csv',
# with the following columns:
# - ID
# - Timestamp
# - Author
# - Emoji
# - Status
# - Link
# 6. Prints a success message if the execution is successful.
# 7. If an exception occurs during execution, it prints the error message.
#
# Dependencies:
# - xml.etree.ElementTree: for parsing the XML file
# - csv: for writing the data to a CSV file
# - html: for decoding HTML entities in the content
# - re: for using regular expressions to extract the ID from the URL
# - datetime: for converting the timestamp string to a UNIX timestamp
#
# Usage:
# 1. Save this script to a file (e.g., feedparser.py).
# 2. Run the script using the Python interpreter: `python feedparser.py`
# 3. When prompted, enter the name of the XML file to parse (e.g., feed.xml).
# 4. The script will process the file and generate an 'output.csv' file in
# the same directory.
# 5. If any errors occur, the script will print the error message and provide
# helpful information.
#
# =============================================================================
"""
Extract the emoji from the title string.
Args:
title (str): The title string from which to extract the emoji.
Returns:
str: The extracted emoji, or an empty string if no emoji is found.
"""
# Split the title by whitespace
=
# Get the second element as the emoji
return
return
"""
Extract the status ID from the given URL.
Args:
id_url (str): The URL containing the status ID.
Returns:
str: The extracted status ID, or an empty string if no ID is found.
"""
# Extract the digits at the end of the URL
=
return
return
# Ask the user for the file name
=
# Parse the XML file
=
=
# Open a CSV file for writing
=
=
# Write the header row in the CSV file
# Initialize a set to store unique IDs
=
# Initialize a dictionary to store duplicate IDs and their corresponding status updates
=
# Iterate over the entries in the XML file
= .
= .
= .
# Convert the timestamp string to a UNIX timestamp
=
=
= .
=
=
=
=
=
# Check if the ID is already in the set
# If it's a duplicate, store it in the duplicate_ids dictionary
=
# If it's a unique ID, add it to the set and write the row to the CSV
# Check if there were any duplicate IDs
# Raise an error with the duplicate IDs and their corresponding status updates
=