Joshua R. Bruce

Collecting Dept. of Energy-Supported Patents (with R)

05 Apr 2017

This short guide illustrates how to collect information on patents supported by the U.S. Department of Energy through government-awarded grants and contracts. The data is available from a freely accessible DoE API. I’m using R to download and manipulate this data.

To begin, you’ll need to require the R libraries XML and xml2.


These packages make it simple to download and reformat the raw XML files from the DoE API. To get a sense of what this data looks like, we can download and inspect a single record. The base text of the API call is:, to which we add nrows=1&page=0, telling the API we only want the first result of however many pages are available. In order to make sense of the results, we will also parse the XML results…


… which returns the following:

<?xml version="1.0" encoding="UTF-8"?&rt;
<rdf:RDF xmlns:rdf="" xmlns:dc="" xmlns:dcq=""&rt;
  <records count="37698" morepages="true" start="1" end="1">
    <record rownumber="1">
      <dc:title>Silicon carbide whisker reinforced ceramic composites and method for making same</dc:title>
      <dc:creator>Wei, G.C.</dc:creator&rt;
      <dc:description>The present invention is directed to the fabrication of ceramic composites which possess improved mechanical properties especially increased fracture toughness. In the formation of these ceramic composites, the single crystal SiC whiskers are mixed with fine ceramic powders of a ceramic material such as Al{sub 2}O{sub 3}, mullite, or B{sub 4}C. The mixtures which contain a homogeneous dispersion of the SiC whiskers are hot pressed at pressures in a range of about 28 to 70 MPa and temperatures in the range of about 1,600 to 1,950 C with pressing times varying from about 0.75 to 2.5 hours. The resulting ceramic composites show an increase in fracture toughness which represents as much as a two-fold increase over that of the matrix material.</dc:description>
      <dcq:publisherAvailability>Patent and Trademark Office, Box 9, Washington, DC 20232 (United States)</dcq:publisherAvailability>
      <dcq:publisherResearch>Union Carbide Corporation</dcq:publisherResearch>
      <dcq:publisherCountry>United States</dcq:publisherCountry>
      <dc:relation>Other Information: DN: Reissue of US Pat. No. 4,543,345, which was issued Sep. 24, 1985; PBD: 24 Jan 1989</dc:relation>
      <dc:format>Medium: X; Size: [10] p.</dc:format>
      <dc:identifier>OSTI ID: 27688, Legacy ID: OSTI ID: 27688</dc:identifier>
      <dc:identifierReport>US RE 32,843/E/</dc:identifierReport>
      <dc:identifierOther>Other: PAN: US patent application 6-847,961</dc:identifierOther>
      <dc:rights>Patent Assignee: Martin Marietta Energy Systems, Inc., Oak Ridge, TN (United States)</dc:rights>
      <dcq:identifier-purl type=""/>

We can thus see what the variables are in this dataset; there are 29 of them total. The value we’re most interested in is the identifierReport field, which includes the patent number. I find converting the XML results to a list makes it more easily explorable, which we do with the xmlToList() command.

result_as_list <- xmlToList(xmlParse(read_xml('')))

Once converted to a list, we can see the number of variables in the first item as follows: length(result_as_list$records[[1]]).

In addition to this record’s information, the results also tell us how many total records there are in the third line of the results, which states: records count="37698". The entire DoE dataset is 37,698 records (last checked and updated August 22, 2017), and it grows continually as the DoE adds new information on old patents it identifies or new patents are issued.

To collect the whole dataset, I begin by constructing a dataframe to hold all of the records.

energy_records <- = 37698, ncol = 29))

We can then add the variable names from our list to the energy_records dataframe, as follows.

# Add variable names to data frame
for(i in 1:ncol(energy_records)){
  colnames(energy_records)[i] <- names(result_as_list$records[[1]])[i]

Having created a dataframe to store the whole DoE database (as it exists at the time of data collection), a loop can be used to iteratively call the DoE API, transform the XML results, and place the relevant information in the corresponding columns of the energy_records dataframe. I have broken apart the XML processing steps to better understand any errors that emerge while the loop is running, but this isn’t necessary. Note that I have changed the number of records returned per page to 3000, the maximum allowed by the API. Given the 37,000+ records, we need to run the call 13 times to get all results, leading to for(k in 0:12) in the first line of code.

# Loop to get all DoE records 
for(k in 0:12){
  step1 <- read_xml(paste0('',k))
  step2 <- xmlParse(step1)
  step3 <- xmlToList(step2)
  for(i in 1:length(step3$records)){
    for(j in 1:length(step3$records[[1]])){
      if(k == 0){
          energy_records[i,j] <- step3$records[[i]][[j]][1] },
          silent = TRUE )
      if(k > 0){
          energy_records[(i+(3001*(k-1))),j] <- step3$records[[i]][[j]][1] },
          silent = TRUE )
    # I print the page and row number as a status marker, but not necessary 
    print(paste0(k,': ', i))

End Note

In conclusion, this script allows us to download the entire body of the DOepants database and structure it into a single dataframe. With this data in hand, it’s possible to link these records with numerous other datasets, such as USPTO records made available through