Three-Level Hierarchy (was New version, part 8)

I notice that the Crapwatch uses a 2-level hierarchy, rather than a 3-level hierarchy as above. The 3-level hierarchy makes it clear if something is a typo, or if something is a direct match to something in WP:CRAPWATCH/SETUP.

For instance

Rank Target/Group Entries (Citations, Articles) Total Citations Distinct Articles Citations/article

148 Blaze Media
[WP:RSP § Generally unreliable]
WP:RSP#Blaze Media
11 7 1.571

would be much better understood as

Rank Target/Group Entries (Citations, Articles) Total Citations Distinct Articles Citations/article

148 Blaze Media
[WP:RSP § Generally unreliable]
WP:RSP#Blaze Media
11 7 1.571

since Blaze Magazine is a typo/variant of The Blaze (magazine)

Likewise with Hindawi, you have

Rank Target/Group Entries (Citations, Articles) Total Citations Distinct Articles Citations/article

4 Hindawi Publishing Corporation
[Beall's publisher list*]
Originally listed on Beall's list, but later removed as a 'borderline case'



2478 2097 1.182

which would be a lot clearer to understand (to humans) why Int J Inflam was listed if it was under

Rank Target/Group Entries (Citations, Articles) Total Citations Distinct Articles Citations/article

4 Hindawi Publishing Corporation
[Beall's publisher list*]
Originally listed on Beall's list, but later removed as a 'borderline case'



2478 2097 1.182

Int J Inflam would still be only counted once in the statistics, even if it was listed twice. Headbomb {t · c · p · b} 09:50, 18 March 2019 (UTC)

Still plugging away at this. Every time I think I'm close, I stumble across another special case. -- JLaTondre (talk) 17:46, 31 March 2019 (UTC)

Source code?

I just realize that I don't think I remember you uploading the JL-Bot code publicly? It's a fairly advance piece of software now, and I'm starting to get worrying about the bus factor here. Would you be willing to put the code up somewhere (possibly in a {{infobox bot}} on the bot's userpage)? Headbomb {t · c · p · b} 21:18, 12 November 2019 (UTC)

Exclude bluelinks/redlinks with JCW-patterns

It would be useful if we could exclude bluelinks/redlinks from matching with {{JCW-pattern}}. For example,

  • {{JCW-pattern|Online|*Online*|!Nonlinear!|exclude=bluelinks}}

would only match redlinks. This would be useful in the case of something like

which would exclude the first four entries, but not the last one. Conversely,

  • {{JCW-pattern|Online|*Online*|!Nonlinear!|exclude=redlinks}}

would only match bluelinks, and in this case, keep the first four entries, but exclude the last one. Headbomb {t · c · p · b} 12:50, 19 November 2019 (UTC)

Bot removals?

What's this new section? What is its purpose / How does it work? Headbomb {t · c · p · b} 10:45, 24 November 2019 (UTC)

LOL, and here I was thinking my edit summaries were pretty clear. ;-) I saw the bot removed a couple entries it shouldn't have so I restored them all until I can figure out what went wrong. Since you had done multiple edits in between, I couldn't simply revert the bot. It was easier to copy them all to their own section. Also makes it easier on me to debug. Once done, I'll move the ones that should stay back to their proper places. -- JLaTondre (talk) 13:38, 24 November 2019 (UTC)
Didn't even look at the edit summaries/contribution history. I was just editing the page as usual and noticed it and was like... WTF? This is new/I don't recall the bot doing this before. I didn't know it was a 'manual' temporary thing. Headbomb {t · c · p · b} 13:59, 24 November 2019 (UTC)

DOI inline merge into main grouping when possible

In WP:JCW/Publisher5#Mary Ann Liebert you have

and then later

The second entry comes from a {{doi-inline}} template, and isn't properly merged into the main grouping. Headbomb {t · c · p · b} 12:15, 18 December 2019 (UTC)

Partially addressed. It will no longer produce the duplicate listing (now checks for 'TITLE (journal)' and 'TITLE (magazine)' as well as 'TITLE' matches). However, in looking at this, I realized that it's not properly updating the article counts in these cases. I will work on that. -- JLaTondre (talk) 02:51, 27 December 2019 (UTC)

WP:JCW/DOI nightly runs

Would be a good idea to do runs if Category:Redirects from DOI prefixes has new/different members in it. I don't believe anything would change except for |registrant= in the compilation, so maybe a seperate subroutine to just sync |registrant= with the category would be enough. Headbomb {t · c · p · b} 15:44, 10 January 2020 (UTC)

The doi processing is pretty quick. For now, I will have it run if there are any other updates. I can add in the category check in awhile. -- JLaTondre (talk) 22:04, 10 January 2020 (UTC)
Didn't run alongside the other updates last night. Still to be implemented, or a bug? Headbomb {t · c · p · b} 17:17, 12 January 2020 (UTC)
Manually running it. Should run with future ones. -- JLaTondre (talk) 01:35, 13 January 2020 (UTC)

@JLaTondre: I think the bot chocked last night. Headbomb {t · c · p · b} 11:49, 18 January 2020 (UTC)

Server had an internet outage last night. It will run tonight. -- JLaTondre (talk) 21:29, 18 January 2020 (UTC)
@JLaTondre: Did the bot crash last night? It only edited [1], and I know for a fact that there was some changes in DOIs and exclusions. Headbomb {t · c · p · b} 19:49, 2 February 2020 (UTC)
Issue resolved. Should run tonight. -- JLaTondre (talk) 23:19, 2 February 2020 (UTC)
Still a nope. I wonder if it's because I'm editing the config pages midrun. That hasn't been an issue before though. Headbomb {t · c · p · b} 11:13, 3 February 2020 (UTC)
No, it was due to a typo on my part. I uploaded last night's results. It's now running the 20200201 dump. -- JLaTondre (talk) 22:20, 3 February 2020 (UTC)

Also User:JL-Bot/DOI could be updated with every dump (with the new-template based format). Headbomb {t · c · p · b} 09:48, 16 February 2020 (UTC)

Yes, that was originally a one-off. I'll change it to update with new pages. -- JLaTondre (talk) 00:55, 18 February 2020 (UTC)
It's mostly to provide a semi-monthly reset because I'm changing various patterns to see if it matches something that already exists on Wikipedia. And also to get new registrants. Once per dump is all that's needed here. Headbomb {t · c · p · b} 14:37, 25 February 2020 (UTC)
Ah, you changed topics. I was thinking this was related to WP:JCW/DOI & was a listing of its subpages. The CrossRef retrieval takes over 10 hours. That is a pretty extensive use of their resources. I've kicked it off, but let's see how much delta there is in the results before we query them monthly. Tomorrow, I'll run the upload to the User:JL-Bot/DOI pages. I've changed the output format over to the templates you created. By the way, I've been working the three level hierarchy as well as some performance improvements. I'm hoping to have that wrapped up in the next couple of weeks, but my schedule is constrained at the moment and it requires some significant changes (needed to change the data structures in order to have the information required for the hierarchy at the point it is generated). I'll be spending quite a bit of time validating the output. -- JLaTondre (talk) 23:59, 25 February 2020 (UTC)
Cool beans! If the delta is small, what could be done is something like a base-reset (no queries to CrossRef), with a full refresh once per month/three months/six months/year/whatever. Headbomb {t · c · p · b} 01:25, 26 February 2020 (UTC)
The results are up. I fixed an issue that caused the last page (10.37000) not to be saved last time. Excluding that, there are still a significant number of differences between the two CrossRef results. Mostly minor changes in the formats of names, but some major changes as well as new listings. I will post a user friendly comparison in a bit. -- JLaTondre (talk) 22:21, 26 February 2020 (UTC)
See User:JL-Bot/DOI/Deltas. -- JLaTondre (talk) 00:03, 27 February 2020 (UTC)

@JLaTondre: very useful. I've removed the 37000s to get a more representative sense of what a typical delta would be. Whatever frequency we settle on for the JL-Bot/DOI updates, uploading a delta automatically would be very useful. Headbomb {t · c · p · b} 00:17, 27 February 2020 (UTC)

The 37000s were a valid delta. While they didn't get uploaded last time, I had collected the data. -- JLaTondre (talk) 01:04, 27 February 2020 (UTC)
Ah I see. Well I suppose it makes sense, if DOI prefixes get assigned in roughly sequential order. Headbomb {t · c · p · b} 01:07, 27 February 2020 (UTC)

Automatic DOI-based subscriptions for WP:JCW/PUB and WP:CITEWATCH

Now that we have a substantial amount of DOIs, it would be good if the bot automatically 'selected' publishers and journals based on Category:Redirects from DOI prefixes.

For example 10.1068 has this

#REDIRECT[[SAGE Publishing]]
{{R from DOI prefix|registrant=Pion Ltd}}

For SAGE Publishing, this would basically be every Redirects from DOI prefixes that points to SAGE Publishing (with each |registrant= found in those redirects listed as |imprint#=) or which has |registrant=SAGE Publishing (in this case nothing)

{{JCW-selected |SAGE Publishing |imprint1=Pion Ltd |doi1=10.1068 |doi2=10.1106 |doi3=10.1177 |doi4=10.1191 |doi5=10.1243 |doi6=10.1258 |doi7=10.1345 |doi8=10.1354 |doi9=10.1369 |doi10=10.1622 |doi11=10.1630 |doi12=10.2182 |doi13=10.2189 | |doi14=10.2511 |doi15=10.2968 |doi16=10.3317 |doi17=10.3821 |doi18=10.4135 |doi19=10.4137 |doi20=10.4219 |doi21=10.5034 |doi22=10.5126 |doi23=10.5193 |doi24=10.5301 |doi25=10.5367 |doi26=10.7182 |doi27=10.17322 |doi28=10.31124}}

For Pion Ltd, this would basically be everything that points to Pion Ltd (in this case nothing, since it redirects to SAGE Publishing) or has |registrant=Pion Ltd (10.1068)

{{JCW-selected|Pion Ltd|parent1=SAGE Publishing|doi1=10.1068}}
  • For WP:JCW/PUB, it should do this automatically for all DOI redirects.
  • For WP:CITEWATCH, it should do this automatically only if there's a target/registrant match with a corresponding entry on WP:CITEWATCH/SETUP.

Headbomb {t · c · p · b} 08:49, 16 February 2020 (UTC)

