1. About
  2. Features
  3. Explore

Recently, I added a new paper to my personal website, and it appeared on Google Scholar a couple of days later. On my website, all I did was write the name of the paper, together with the authors and name of the conference, and then provided a link to the PDF. This information alone somehow informed Google Scholar that the text I added was actually a new paper. There is no other information about this paper on the web, so I know that Google only used my website to update Google Scholar.

So what I am wondering, is how does Google know what is a paper, and what is just some arbitrary text on my website? For example, if I had only written the name of the paper, without the authors and conference, and without the PDF, would this still have been detected?

On my website, the paper is listed on a webpage called "Publications", in a list with a load of other papers, but this is quite specific to the design of my own website. I'm wondering whether it has something to do with the PDF which I provided a link to. Perhaps it inspected the PDF and decided it was a paper, and if I hadn't added the PDF, it would not detect it as a paper. But then again, the HTML formatting does not necessarily indicate which text the PDF is actually associated with, even if it is obvious to a human on inspection of the webpage. Or perhaps Google Scholar just has some hand-engineered search which looks for instances of HTML where there is the name of a known conference, known authors, and a PDF nearby.

1 Answer 1

(Warning - gross oversimplifications incoming - if somebody doing research in Information Retrieval wants to add technical detail, be my guest!)

Fundamentally, Google finds all resources (HTML pages, images, as well as papers) on the web in the same way: periodically revisiting each and every resource it knows about, (re-)indexing it, and, for HTML content, following all links to other resources (rinse and repeat). Your web page is likely linked from your department web site, which Google definitely knows about, hence your web page is also in Google's database. Your web page links to your paper, hence Google will also know about your paper the next time the crawler checks your page. How long this will take is undefined, but Google has a lot of crawlers and is pretty smart about when to re-check certain types of pages, so it typically does not take very long.

Now, Google has specific heuristics in place to treat different types of resources differently. For instance, if an HTML page is added to the database keywords will be extracted, links will be followed, etc., while an image will lead to completely different actions. Scientific papers are not different in that sense - as soon as Google finds a PDF or Word file that "looks like" a scientific paper to the automated process, Google will generate paper metadata (title, authors, venue, keywords, ...) by parsing the PDF text as good as it can and add it to its special Google Scholar database, and this is when the paper appears in your profile.

Google's own website goes into quite some detail on this process. It also has instructions for authors looking to get their papers indexed by Scholar.