Using pdfminer to parse out title of a paper

Discussion in 'App Development' started by deanstreet, Jul 1, 2021.

  1. As much as the title, author, date, etc are apparent to the human eyes, I was thinking how to parse out such data for a folder of papers downloaded from [SSRN](https://www.ssrn.com/index.cfm/en/). Most papers don't have metadata properly in them, so must be parsed from the title-page.

    Obviously enough, most titles tend to have the largest font on the title page, so I try to identify the text positions having the max font size and assume they represent the title. Trying on a few files and it works fine but not always.

    Author names don't seem to follow any particular pattern, so I don't know how to parse. Any idea?

    Code:
    import sys
    from pdfminer.high_level import extract_pages
    from pdfminer.layout import LTTextContainer, LTLine, LTChar
    
    def getTitle(filepath):
        """method to extract the title of a pdf file"""
     
        try:
            filepath = filepath.strip()
        except TypeError:
            print("filepath not a string")
       
        if not filepath[len(filepath)-4:len(filepath)] == ".pdf":
            sys.exit("filepath not ending with .pdf")
         
       
        data = extractFirstPageElements(filepath)
        texts = data[0]
        fonts = data[1]
    
        if len(texts) == 0: #in case no text is read, either the pdf text is not readable or first page is blank
            title = "Unknown"
        else:
            maxFontPos = extractTitlePos(fonts)
            uncleanedTitle = extractTitle(texts, maxFontPos)
            title = cleanTitle(uncleanedTitle)
     
        return title
    
    
    def extractFirstPageElements(filepath):
        """helper method to extract text elements and their corresponding font sizes into lists"""
    
        fonts = []
        elements = []
        for page in extract_pages(filepath):
            for element in page:
                if isinstance(element, LTTextContainer):
                    for text_line in element:
                        for character in text_line:
                            if isinstance(character, LTChar):
                                font_size = character.size
                                break
                    fonts.append(font_size)
                    elements.append(element.get_text())
            break
        return [elements, fonts]
    
    
    def extractTitlePos(fonts):
        """helper method to extract the positions having the largest font size"""
        font_pos = []
        maxFont = max(fonts)
        for pos, size in enumerate(fonts):
            if size == maxFont:
                font_pos.append(pos)
        return font_pos
    
    
    def extractTitle(elements, positions):
        """helper method to extract those elements having the largest font size, then return as a joint string"""
        title = []
        for i in positions:
            title.append(elements[i])
        return "".join(title)
    
    
    def cleanTitle(title):
        """helper method to clean the title by removing \n and illegal filename symbols"""
        title = title.strip() #remove whitespaces
        title = title.replace("\n", " ") #remove any inline \n
        title = title.replace(":", " -") #replace invalid filename : symbol
        title = title.replace("?", " ") #replace invalid filename ? symbol
        return title
     
  2. ph1l

    ph1l

    Here is a quick and dirty way to parse document titles with bash, pdftotext, and perl
    Code:
    for f in SSRN*.pdf
    do
        echo
        echo "${f}"
        pdftotext -layout "${f}" - |            # convert pdf to text keeping original layout
        perl -n -e 'use warnings; use strict;
        our @t; # holds the title
        my $line = $_;
        if ( ($line =~ /^\f/) && (scalar(@t) > 0) )
        {
            # form feed after finding title
            print join(" ", @t), "\n"; exit(0);
        }
        $line =~ s/^\s+//; $line =~ s/\s+$//;   # remove leading and trailing white space
        if ( ($line eq "") && (scalar(@t) > 0) )
        {
            # blank line after finding title
            print join(" ", @t), "\n"; exit(0);
        }
        push (@t, $line);   # save next part of title
        '
    done
    
    Example run on Windows 10 with cygwin (would probably work on Linux too):
    SSRN-id1307643.pdf
    Financial Astrology: Mapping the Presidential Election Cycle in US Stock Markets

    SSRN-id1447443.pdf
    Exercises in Advanced Risk and Portfolio Management R (ARPM) with Solutions and Code, supporting the 6-day intensive course ARPM Bootcamp

    SSRN-id2140091.pdf
    Demystifying Time-Series Momentum Strategies: Volatility Estimators, Trading Rules and Pairwise Correlations∗

    SSRN-id264513.pdf
    Spectral Analysis of Economic Time Series Behaviour

    SSRN-id3184092.pdf
    Dynamic Alpha: A Spectral Decomposition of Investment Performance Across Time Horizons∗

    SSRN-id566882.pdf
    Technical Analysis in Financial Markets

    SSRN-id715301.pdf
    A Simplified Approach to Understanding the Kalman Filter Technique
     
    deanstreet likes this.
  3. Can you briefly explain the logic?
     
  4. ph1l

    ph1l

    In psuedocode:
    Code:
    for each pdf file,
        Print the file name to make the output easier to follow.
        Convert a pdf file to text with pdftotext command keeping the layout of the document.
        Save lines from the text until a line starts with a form feed or just has white space.
        Print the saved lines
        Skip to next pdf file.
    
     
    deanstreet likes this.
  5. Like you would enjoy https://app.box.com/s/k8wmgc9bmtx736r9r20ti72owwband2l
     
  6. ph1l

    ph1l