Log in or Sign up

ET News & Sponsor Info

General Topics

Markets

Technical Topics

Brokerage Firms

Company Specific

Interactive Brokers

Tools of the Trade

Trading for a Living

Community Lounge

Site Support

Feedback

Idea in extracting the title of an article?

Discussion in 'App Development' started by blueraincap, Mar 14, 2024.

blueraincap

1,602
Posts
317
Likes

I have a bunch of academic papers on the computer that I need organising.
I need to extract the titles of them, but have not found a valid method yet.
Any idea?
Usually, the title has the largest font in the first page, so I used python (and pdfminer module) to do so, but it is only working 50-60%.

Code:

import sys
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTLine, LTChar

#this module contains one public method to parse the title of a typical PDF academic paper, given its filepath
#the parsing often fails because of difficulty in parsing PDF files

def getTitle(filepath):
    """main method to extract the title of a pdf file"""

    try:
        filepath = filepath.strip()
    except TypeError:
        print("filepath not a string")

    if not filepath[len(filepath)-4:len(filepath)].lower() == ".pdf":
        sys.exit("filepath not ending with .pdf in " + filepath)
  
    try:
        data = extractFirstPageElements(filepath)
        texts = data[0]
        fonts = data[1]
    
        maxFontPos = extractTitlePos(fonts)
        uncleanedTitle = extractTitle(texts, maxFontPos)
        title = cleanTitle(uncleanedTitle)

    except: #mostly due to IndexError as no texts are read in due to unreadable file or blank page, or some readTexts-fontSize mismatch
        title = "Unknown"

    if len(title) > 120: #ultra-long title likely error
        title = "Unknown"

    return title


def extractFirstPageElements(filepath):
    """helper method to extract text elements and their corresponding font sizes into lists"""

    fonts = []
    elements = []
    for page in extract_pages(filepath):
        for element in page:
            if isinstance(element, LTTextContainer):
                for text_line in element:
                    for character in text_line:
                        if isinstance(character, LTChar):
                            font_size = character.size
                            break #first character only, assuming others in text_line have the same size
                fonts.append(font_size)
                elements.append(element.get_text())
        break #read first page only
    return [elements, fonts]


def extractTitlePos(fonts):
    """helper method to extract the positions having the largest font size"""
    font_pos = []
    maxFont = max(fonts)
    for pos, size in enumerate(fonts):
        if size == maxFont:
            font_pos.append(pos)
    return font_pos


def extractTitle(elements, positions):
    """helper method to extract those elements having the largest font size, then return as a joint string"""
    title = []
    for i in positions:
        title.append(elements[i])
    return " ".join(title)


def cleanTitle(title):
    """helper method to clean the title by removing \n and illegal filename symbols"""

    englishArticles = ("a", "an", "the") #to remove if start word of title
            
    title = title.strip() #remove whitespaces
    title = title.replace("\n", " ") #remove any inline \n
    title = title.replace(":", " -") #replace invalid file : symbol
    title = title.replace("?", " ") #replace invalid file ? symbol
    title = title.replace("*", "") #replace invalid symbol
    title = title.replace("@", "") #replace invalid symbol
    title = title.replace("/", " ") #replace invalid symbol
    title = title.replace("  ", " ") #remove potential double whitespaces
    title = title.title() #capitalize each word

    #remove starting article if one
    firstWord = title.split()[0].lower() #first word of title
    if firstWord in englishArticles:
        secondWord = title.split()[1] #second word of title
        secondWordPos = title.index(secondWord) #where second word starts
        title = title[secondWordPos:] #remove the starting article off title

    #perform word capitalization, some keywords should be in all lower-case or upper-case, regular words are letter-capitalized
    words = title.split() #individual words in the title, each letter-capitalized

    #var to hold words that should be all lower-case
    lowercaseWords = ("a", "an", "the", "at", "to", "from", "for", "using", "of", "among", "across", "during","what", "with", "and", "or", "between", "in", "on", "is", "are", "as", "there", "under", "toward", "towards", "through", "via", "by", "based", "vs", "versus", "its", "it", "their")

    #var to hold words that should be all upper-case
    uppercaseWords = ("us", "usa", "eu", "uk", "hk", "nyse", "ftse", "hkse", "pca", "etf", "etfs", "fx", "ipo", "hft", "spx", "vix", "vxx", "adr", "adrs")

    capAdjustedWords = [] #var to hold capitalization-appropriate words

    for word in words: #check each word one by one
        if word.lower() in lowercaseWords:
            capAdjustedWords.append(word.lower()) #capitalize to lower-case
        elif word.lower() in uppercaseWords:
            capAdjustedWords.append(word.upper()) #capitalize to upper-case
        else:
            capAdjustedWords.append(word) #no capitalization
    if capAdjustedWords[0].islower():
        capAdjustedWords[0] = capAdjustedWords[0].capitalize() #captailize in case first word is in lowercaseWords
    title = " ".join(capAdjustedWords)

    #often the final char is some special character so should be removed
    finalChar = title[ len(title)-1 : ]
    if not finalChar.isalnum():
        title = title[0 : len(title)-1] #remove the last char
    

    return title


if __name__ == "__main__":
    filepath = input("Enter the file path (using /): ")
    print(getTitle(filepath))

#1 Mar 14, 2024

Share

murray t turtle likes this.

S2007S
- 26,644
  Posts
- 3,379
  Likes
Isn't AI that's now part of our daily routine and completely the talk of wallstreet able to handle this with a few voice prompts???

#2 Mar 14, 2024

Share

murray t turtle likes this.
Baron Administrator
- 7,074
  Posts
- 6,267
  Likes
Seems like the easiest method would be to go through and rename the filename of each paper to reflect the title of it. That way you can easily search and sort through the filenames (Titles) using your file browser.

Baron Robertson
Founder
EliteTrader.com

+1 (407) 230-9956
baron@elitetrader.com

#3 Mar 14, 2024

Share
blueraincap
- 1,602
  Posts
- 317
  Likes
Baron said:
Seems like the easiest method would be to go through and rename the filename of each paper to reflect the title of it. That way you can easily search and sort through the filenames (Titles) using your file browser.
More...

What the shit are you talking about? Where do the titles come from to begin with the renaming?

#4 Mar 14, 2024

Share
Baron Administrator
- 7,074
  Posts
- 6,267
  Likes
The process I'm referring to is a manual process, not automated. You open up each paper, copy the title of it, and then rename the file by pasting the title as the new filename.

So for example, you might have a paper with a filename of 2022hft.pdf and when you open that file you see that the paper is titled "HFT Activity and Overview in 2022", so you would copy that and rename the file HFT_Activity_and_Overview_in_2022.pdf

Baron Robertson
Founder
EliteTrader.com

+1 (407) 230-9956
baron@elitetrader.com

#5 Mar 14, 2024

Share

murray t turtle likes this.
blueraincap
- 1,602
  Posts
- 317
  Likes
Baron said:
The process I'm referring to is a manual process, not automated. You open up each paper, copy the title of it, and then rename the file by pasting the title as the new filename.

So for example, you might have a paper with a filename of 2022hft.pdf and when you open that file you see that the paper is titled "HFT Activity and Overview in 2022", so you would copy that and rename the file HFT_Activity_and_Overview_in_2022.pdf
More...

My question is how to automate the process..any idiot can do it manually

#6 Mar 14, 2024

Share

d08 likes this.
BMK
- 1,128
  Posts
- 846
  Likes
This may be a good example of the current limits of artificial intelligence...

I don't think there is a reliable way to identify the title of a research paper without actually opening the file and using human reasoning.

There are variables such as a subtitle, and the title of the journal, which make this task very difficult for an algorithm.

There are other contexts, for example... geez, I dunno, maybe electronic filing systems used by the courts, or maybe a system like EDGAR, where they may have strong rules that govern file names, and that would potentially make the task a lot easier.

#7 Mar 14, 2024

Share
2rosy
- 3,208
  Posts
- 1,367
  Likes
https://pypi.org/project/pdftitle/

or open the pdf and see if there is metadata with the title in it

#8 Mar 14, 2024

Share

blueraincap likes this.
blueraincap
- 1,602
  Posts
- 317
  Likes
2rosy said:
https://pypi.org/project/pdftitle/

or open the pdf and see if there is metadata with the title in it
More...

Many pdf files do not have metadata

#9 Mar 14, 2024

Share
BMK
- 1,128
  Posts
- 846
  Likes
2rosy said:
see if there is metadata with the title in it
More...

That was my first thought, that maybe the metadata, or file properties, would contain the title. And PDF properties does indeed contain a field called title. But it many, many cases, the data in that field is completely unrelated to the title of the paper.

There is no standard across academia for how you name a file.

#10 Mar 14, 2024

Share

(You must log in or sign up to reply here.)

Search