1. About
  2. Features
  3. Explore

I am trying to get our department's green OA repo at my university's web-site properly indexed by Google Scholar.

The PDFs in our repo for a large part legacy scans with no structure. This means that there is little hope of Google being able to scrape usable metadata from the PDF, so having an HTML abstract with metadata linked to the PDF is in our case essential for providing Google Scholar with the correct metadata.

Our repo is based upon on the Drupal 7 CMS: The author fills in the Google Scholar Metadata, which is stored as a regular Drupal node (HTML), and then uploads the full length PDF to the site. The PDF is stored in the server's file system.

The way Drupal works means that the URL of the HTML abstract becomes http://example.org/node/NNN, while the uploaded PDF will have an URL like http://example.org/sites/default/files/XXXXXXX.pdf. The two are then linked together through the citation_pdf_url metatag in the HTML.

I was happy with this, until I spotted the following paragraph in section H of the Indexing Guidelines for Google Scholar:

The content of the [citation_pdf_url] tag is the absolute URL of the PDF file; for security reasons, it must refer to a file in the same subdirectory as the HTML abstract. (my emphasis)

Failure to link the alternate versions together could result in the incorrect indexing of the PDF files, because these files would be processed as separate documents without the information contained in the meta tags. (their emphasis)

Given that this means what I think it means, indexing of my repo will break as node/ is indeed not the same "subdirectory" as sites/default/files/.

However, who keeps HTML content in subdirectories in 2014? The current webdesign paradigm is to separate content (such text rendered with HTML) from assets (such as PDF files), and only the latter is actually stored in the file system. To me, this requirements is outdated and will also probably break almost every repo that is built on some sort of CMS.

I am prepared to bite the bullet and write the code that is needed in order to spew out static HTML files that can be saved in the sites/default/files/ subdirectory along with the PDF - if that is what it really takes to make Google Scholar index my repo.

However, before embarking upon this task, I wanted to hear from other that have repos that provide metadata for Google Scholar if this "same subdirectories" is absolutely required if you want the HTML abstract and full length PDF to be linked by means of citation_pdf_url tag.

Will having HTML and the PDF in different subdirectories really lead to incorrect indexing or no indexing of the PDF files?

1 Answer 1

Google Scholar is pretty smart about these things. On all of the websites I'm involved with, we keep the PDFs in a separate directory from the page that links to them, and Scholar finds them just fine. Of course, we also don't us the citation_pdf_url tag, so maybe we're fine because we're low-tech and Scholar does the inference for us...

I recommend doing it the easy way, and trusting Scholar to be smart. If it doesn't do the right thing, then you can always go back and drop the tag and wait for a refresh. And then if that doesn't work, you can do it the hard way.