HackerNews – Initializing Tables with runOnce.py

news, daily newspaper, press-1172463.jpg

In a project, we usually want to execute codes many times, but there are also some codes we only want to execute once. It is a good idea to utilize a file made specifically for running a program once and let’s call it in this case the runOnce.py file.

For this HackerNews project, we will initialize and populate tables within our runOnce.py file to create ops_tracker, ops_name, top_stories, and users. Information for each table below:

  1. ops_tracker: Shows the time stamps of each automated execution within the database. Contains primary_key, tracking_key, start_time, end_time, and the ops_key.
  2. ops_name: Shows the name and details about what each operation does classified by an ops_key. Contains: ops_key (primary_key of table), ops_name, ops_detail, effective_start_date, and effective_end_date.
  3. top_stories: Shows the top stories created and are categorized by the tracking_key. Contains primary_key, tracking_key, id, deleted, type, by, time, text, dead, parent, poll, and kids.
  4. users: show user information tied with primary_key and tracking_key. Contains primary_key, tracking_key, id, created, karma, about, and submitted

Alright, now that we know what we are going to initialize, let’s do it using Python!

First, let’s import our required libraries and connect to the database

import pandas as pd
import os as os
from sqlalchemy import (create_engine, Column, DateTime, Integer, MetaData, Table, Boolean,
                        VARCHAR)
from sqlalchemy.dialects.postgresql import TIMESTAMP
from dotenv import load_dotenv

load_dotenv('.env')

user = os.environ['login']
password = os.environ['password']
dbname = os.environ['dbname']
host = os.environ['host']
port = os.environ['port']

engine = create_engine(
    (
        f"postgresql://{user}:"
        f"{password}"
        f"@{host}:{port}/{dbname}"
    )
)

connection = engine.connect()

Now initialize our tables by declaring a variable, table name, and column information to each table.

metadata = MetaData()

ops_tracker = Table(
    'ops_tracker', metadata,
    Column('primary_key', Integer, primary_key=True, autoincrement=True),
    Column('tracking_key', Integer, nullable=False),
    Column('start_time', TIMESTAMP, nullable=False),
    Column('end_time', TIMESTAMP, nullable=False),
    Column('ops_key', Integer, nullable=False)
)

ops_name = Table(
    'ops_name', metadata,
    Column('ops_key', Integer, nullable=False, primary_key=True),
    Column('ops_name', VARCHAR, nullable=False),
    Column('ops_detail', VARCHAR, nullable=True),
    Column('effective_start_date', TIMESTAMP, nullable=False),
    Column('effective_end_date', TIMESTAMP, nullable=False)
)

top_stories = Table(
    'top_stories', metadata,
    Column('primary_key', Integer, primary_key=True, autoincrement=True),
    Column('tracking_key', Integer, nullable=False),
    Column('id', Integer, nullable=True),
    Column('deleted', Boolean, nullable=True),
    Column('type', VARCHAR, nullable=True),
    Column('by', VARCHAR, nullable=True),
    Column('time', Integer, nullable=True),
    Column('text', VARCHAR, nullable=True),
    Column('dead', Boolean, nullable=True),
    Column('parent', Integer, nullable=True),
    Column('poll', Integer, nullable=True),
    Column('kids', VARCHAR, nullable=True),
    Column('url', VARCHAR, nullable=True),
    Column('score', Integer, nullable=True),
    Column('title', VARCHAR, nullable=True),
    Column('parts', VARCHAR, nullable=True),
    Column('descendants', Integer, nullable=True),
    Column('extracted_datetime', TIMESTAMP, nullable=True)
)

users = Table(
    'users', metadata,
    Column('primary_key', Integer, primary_key=True, autoincrement=True),
    Column('tracking_key', Integer, nullable=False),
    Column('id', VARCHAR, nullable=False),
    Column('created', Integer, nullable=True),
    Column('karma', Integer, nullable=True),
    Column('about', VARCHAR, nullable=True),
    Column('submitted', VARCHAR, nullable=True),
    Column('extracted_datetime', TIMESTAMP, nullable=True)
)

metadata.create_all(engine)

With metadata.create_all(engine) , we have now created empty tables in our schema.

Since our tables are currently empty, we will first fill in our ‘ops_name’ table with information on certain operations we will use later on.

dataOpsName = pd.DataFrame(
    {
        'ops_key': [1,2,3],
        'ops_name': ['retrieve_top_stories', 'stories_update', 'users_update'],
        'ops_detail': ['Initial operations to get top news', 'Update top news', 'Update data related to users'],
        'effective_start_date': ['2023-12-03','2023-12-03','2023-12-03'],
        'effective_end_date': ['9999-12-31','9999-12-31','9999-12-31']
    }
)

dataOpsName.to_sql(
    name = 'ops_name',
    con=connection,
    if_exists='append',
    index=False
)

connection.commit()

Now, we have inserted rows into the ops_names table! This will be the key to what type of operations is used for other operations tables.