Wikipedia talk:Copyright problems

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

A "copyvio" bungle on the Backup article, and what we should learn from that about CopyPatrol problems[edit]

Early in the morning of 29 November 2018 I wrote an edit of the "live data subsection of the "Backup article, as a combination of material that was currently in the subsection, material that had been in the subsection until it was deleted by JohnInDC the previous day [1], and material that I had written from scratch early that morning. AFAICT the subsection had been previously basically unchanged since around 2008, so I was amazed to discover that User:Username_Needed had reverted my edit and placed a copyvio template on my Talk page about 2 hours later. I immediately protested "I am mystified ..." on a new section of Username Needed's Talk page, ending that comment with "Please tell me—or have your bot tell me—where the copyright violation lies, so that I can fix it."

(To preserve a comprehensive narrative, I have copied comments I originally made in that section of Usermame Needed's Talk page into this section of my own Talk page.)

About 18 hours after his/her reversion of my edit, User:Username_Needed reverted that reversion with the Edit Summary "SelfRev- misidentification". Still no comment from him/her on where the copyvio lay. Another 7 days later User:Username_Needed put on my Talk page a first feeble apology, which said "I had misidentified an edit as a copyvio when the actual material was elsewhere. Sorry for the confusion." I responded to that with "What does 'elsewhere' mean? In the 'Backup' article, but not in the 'Live data' subsection? In some other article?" My response went on to describe tests I had made with Earwig's Copyvio Detector before and after his/her original reversion of my edit, which showed a 15.3% probability of copyvio before the original reversion in a (properly quotemarked and footnoted) quote in a completely different section of the WP article, and showed a 16.7% probability of copyvio in the "Live data" subsection after the original reversion. Much of the "Live data" subsection appears to have originally been written based on the notes of a 1997 University of Wisconsin lecture by a database administrator; if that 16.7% probability represents a a genuine copyvio, it's likely to be related to that (not quotemarked but properly footnoted) lecture note material—which the database administrator may have copied from Oracle Corp.'s 1997-era documentation.

User:Username_Needed then replied on my Talk page "I was using User:Crow's copypartol tool, which has quite a complicated interface, and I misread it. I have no idea if there ever was any copyvio, whether it has been removed and whether it is still in the revs." I replied on User_talk:Username_Needed#Backup "I just used that CopyPatrol tool to discover a "copyright violation", but it's in the reverse direction! [2] is [explicitly at the top of the blog] dated 10 April 2013, which by internal evidence is almost 5 years after the existing material I edited in the "Live data" subsection was written. If you use View History for the WP article, and go back to a version before that date, it will become pretty obvious that [a consultant in Los Angeles County] copied at least that subsection of the WP article without AFAICT crediting Wikipedia."

(In apparent reaction to a message I—perhaps naively—left on his voicemail the other day, the Los Angeles County consultant has now removed the copyright-violating material from the current version of his blog—but it remains captured on the Wayback Machine on 18 May 2014; I have substituted the Wayback URL in the link in the preceding paragraph.)

What User:Username_Needed's bungle reminds me of is something that happened while my father was teaching me to drive around our suburban neighborhood in 1956. Disregarding his warning I went around a small traffic circle too fast, and ended up going across someone's lawn. Having a bit more intestinal fortitude than User:Username_Needed, I apologized to my father as soon as—not 7 days after—he drove our car off the lawn and around the corner and told me to get back behind the wheel. Maybe WP editors should be required to go through a Learner's Permit phase under the supervision of another CopyPatrol-experienced editor before being allowed to use CopyPatrol to identify copyvios on their own.

Earwig's Copyvio Detector didn't detect the reverse copyvio, but that may simply be because Earwig's Copyvio Detector doesn't look at non-institutional blogs such as that of the Los Angeles County consultant. OTOH Earwig's Copyvio Detector may have coding to detect reverse copyvios and ignore them (comparing in only one direction and seeing which file creation date is earlier are two simple-minded techniques that occur to me); maybe CopyPatrol should have similar coding added. DovidBenAvraham (talk) 10:44, 12 December 2018 (UTC)

  • I will not be using the Crow's copyvio tool in the future, but could you please stop trying to insult me. I get that you're angry at being accused of something that you didn't do. This is justified, but throwing insults isn't going to help. Again, I apologise for upsetting you with an incorrect accusation, but it was a mistake. No malicious intent or upset was intentional. Also, I was offline for a while, hence why I did not respond immediately. [Username Needed] 10:52, 12 December 2018 (UTC)
You forget that I saw the results of your CopyPatrol run on the "Backup" article before those results were invalidated by the Los Angeles County' consultant's deleting his reverse copyvio material. They showed 6-7 pages of text that was highlighted as being a copyvio. That IMHO would have alerted any WP editor experienced with copyvios that the copyvio might be in the reverse direction. Since the Los Angeles County consultant had explicitly put a 10 April 2013 date at the top of his blog post, it would have taken that experienced WP editor no more than 5 minutes to use View History to access a version of the "Backup" article written before 10 April 2013. 2 minutes of manually comparing text would then have convinced the experienced WP editor that the Los Angeles County consultant had copied from the WP article, rather than in the other direction. I think the foregoing thoroughly establishes your need for some Learner's Permit time; sorry you consider that an insult.
Next we must consider your delay of over 7 days in giving me any statement of what my alleged copyvio consisted of. The only conceivable answer you could have given from the CopyPatrol run is that I had copied essentially the entire "Backup" article into WP, and 2 minutes using View History would have convinced anyone that that answer was nonsensical. Do you have evidence that you asked any other WP editor for advice? No, you apparently just sat there with your mouth shut until I pressed you for an answer. And after 7 days you blamed your problem on the CopyPatrol interface. I think your lack of intestinal fortitude in this matter is pretty well demonstrated; you should have grown out of that in your teenage years. Heck, I just admitted in this section that my phoning the Los Angeles County consultant was an error in judgement on my part. We all have room to grow morally and intellectually. DovidBenAvraham (talk) 12:11, 12 December 2018 (UTC)
NB. I have had several conversations with this editor about WP:NPA, WP:Civility and their variants about 3/5 of the way into the section here. JohnInDC (talk) 12:19, 12 December 2018 (UTC)
It was not the article As a whole that insulted me, it was the line "Having a bit more intestinal fortitude than User:Username_Needed". I was not waiting to respond, I was merely offline. I also consider bringing me personally into this conversation in the first place was continuing on with a mistake that already had been resolved. [[[User:Username Needed|Username]] Needed] 13:06, 12 December 2018 (UTC)
I agree with both points - that this is beating a dead horse (at needless length to boot), and that the complaint was gratuitously insulting. JohnInDC (talk) 17:11, 12 December 2018 (UTC)

Copied from my personal Talk page, with elisions of the Los Angeles County consultant's name. You'll see the reason for the copy below:

It doesn't violate copyright. He can use it freely. See WP:Copyrights. Probably all he needed to do was to add an attribution somewhere on his blog page - if that. I wonder - is this better, or worse, than making a mistake with an unfamiliar copyvio tool? JohnInDC (talk) 11:56, 12 December 2018 (UTC)
Except I didn't say in my voicemail message that [the Los Angeles County consultant] should delete the copied WP text. I just stated that there appeared to be a copyright violation, and gave my phone number. By no particular coincidence, yesterday I took a fast look at the appropriate WP article on copyright. I think you're correct that adding an attribution would have been sufficient, probably accompanied by adding back those few citations that were in the WP article as of 10 April 2013. But then that wouldn't have showed off [the Los Angeles County consultant]'s consultative brilliance to the same extent. Should I have just said nothing?
As far as "making a mistake with an unfamiliar copyvio tool", I've replied to Username Needed's complaint of having been insulted here. I didn't mention my temporary decision to quit editing Wikipedia for at least 3 months, made before Username Needed reverted his/her reversion. Thank the Lord I didn't hit anyone rolling across that lawn in 1956. DovidBenAvraham (talk) 12:55, 12 December 2018 (UTC)
Yes, you should have said nothing. Because you don’t know what you are doing, you left a misleading message on a private party’s voicemail claiming a “copyright violation“ on material which with only the most minor of adjustments, he is fully entitled to use, and caused him to take an action that he didn’t need to take. The effect of your dispute here is now rippling outside of Wikipedia. You are making things worse, not better. JohnInDC (talk) 13:12, 12 December 2018 (UTC)
I'll concede that Username Needed's intestinal fortitude may be the equal of that of Luke Skywalker or Lara Croft. But that concession means we would have to find another reason for the 18-hour delay in reverting his/her reversion, and the further 7-day delay in providing an explanation that proved to be inadequate. I hereby propose that, in searching for copyright violations in a great many WP articles, Username Needed is stretching himself/herself too thin. IMHO Username Needed has admitted above that he/she should do less WP editing, and do it more thoroughly.
If JohnInDC works his way through the links in the Wayback Machine version of [the Los Angeles County consultant]'s Web page, he will find this page that also includes an 18 April 2013 article entitled "Viruses, Trojans and Malware in general". A Google search on the phrase that consists of the first 6 words of that article led me to this page which—in an updated version—still exists on Cisco's website. Truly "The wicked flee when no man pursueth ...", but someone connected with Cisco—which may have insisted on his/her adding more than an attribution—seems to have pursued the Los Angeles County consultant about that part of the Web page. DovidBenAvraham (talk) 18:35, 13 December 2018 (UTC)

DovidBenAvraham (talk) 19:04, 13 December 2018 (UTC)

  • I will say again. This horse is dead. Also wikilinking my name pings me. I am getting 3-4 pings everytime you make an edit. As explained above, I was offwiki for 18 hours, then breifly came back on, noticed that I had made an incorrect edit, and reverted it, lacking the insight to realise that you wanted an explanation, until you elaborated further. Sometimes people make mistakes, and then do not realise that they have made them until a day later. 09:01, 14 December 2018 (UTC)

The reason for the copy above is that CopyPatrol raises several policy questions for the "WP copyvio detection squad", which is my term for the editors who detect and deal with copyvios:

  • Should CopyPatrol now be used by the "WP copyvio detection squad"? 
    Not at all, IMHO, until both its GUI and its user documentation is greatly improved. CopyPatrol, whose "seekrit" documentation is here, is AFAICT an apparently new combination of Turnitin and the database in User:EranBot. Trigger warning: The rest of this paragraph will almost certainly cause the reader to slap his/her forehead, and may cause him/her to roll on the floor laughing/crying. The "Backwards copies" paragraph in "Usage" starts out with "Note also the possibility of backwards copies, where the source appeared to copy content from Wikipedia. This is not necessarily a false positive ...", which does not really disclose that CopyPatrol shows reverse copyvios (at least more often than Earwig's Copyvio Detector)—which IMHO is what really confused Username Needed on 29 November. Also, because CopyPatrol—to avoid repeats of lengthy compare processing—maintains a database containing the results for every WP article already compared, there appears to be no way in the GUI to initiate a rerun of an article already checked; that database already contains the results of compares made in October 2016—probably with Earwig's Copyvio Detector rather than CopyPatrol—for the "Retrospect (software)" WP article. (This makes sense for the Turnitin commercial product, because high school and college students must at the least re-submit—presumably as a different document—any paper that has been previously flagged for plagiarism.)
  • After CopyPatrol has its GUI and user documentation fixed, should it be used by the "WP copyvio detection squad"? 
    Not unless each squad member allowed to use it has gone through supervised "Learner's Permit" training that teaches how to distinguish between an outside-article-to-WP-article "normal" copyvio and a WP-article-to-outside-article "reverse" copyvio. Moreover, when Username Needed told me on 7 December that he/she had used CopyPatrol on 29 November, I was able to recognize the "reverse" copyvio because of three special circumstances: (1) The Los Angeles County consultant had (stupidly IMHO) inserted a 10 April 2013 date at the top of his copy. (2) I knew that the first 7 pages of the WP "Backup" article—highlighted by the CopyPatrol GUI—had not previously been substantially edited since 2011. (3) I was willing to spend some time on this, because it was allegedly my copyvio. Those circumstances combined to enable me to identify a "reverse" copyvio, but a member of the "WP copyvio detection squad" would be unlikely to encounter them again. So in addition at a minimum CopyPatrol needs a facility to report the File Creation Date of the outside document being compared to the WP article (whose Date Added can be approximated from View History), but I don't know enough about the details of Turnitin to know if that's feasible for a non-WP Web document.
  • Once CopyPatrol is ready to be used by a re-trained "WP copyvio detection squad", should ordinary "squad" members—or anyone—be allowed to communicate with perpetrators of "reverse" copyvios (which CopyPatrol will detect a lot of)? 
    Not ordinary members, because the experience reported in my 12:55, 12 December 2018 (UTC) comment shows that that communication requires a more delicate touch than I—for one—possess. IMHO an ordinary "WP copyvio detection squad" member, upon discovering an apparent "reverse" copyvio, must communicate that fact to a supervisory "squad" member having further specialized training and HR talent. However the supervisory squad member should not hesitate to communicate with the perpetrator, because AFAIK (IANAL) a years-long history of failing to pursue "reverse" copyvios may eventually result in a judicial decision that Wikipedia has lost its right to insist upon attribution. DovidBenAvraham (talk) 04:27, 14 December 2018 (UTC)
  • I'm trying to make sense of the wall of text above. It looks like somebody misidentified something as a copyvio when it was in fact a reverse copyvio and fixed the mistake and apologised when this was pointed out to them. In which case there isn't much to discuss here. People are human and make mistakes from time to time, the only thing we can do in that situation is fix the mistake and apologise. Every automated copyvio detection tool is only an aid to human judgement, the results should never be used blindly (the way you're quoting Earwig confidence percentages suggests you're doing this as well). You've now graduated to making very elaborate proposals involving the creation of something called the "WP Copyvio Detection Squad", several ridiculously bureaucratic proposals which will basically ensure nobody does any copyvio detection work, and technical proposals which aren't very workable. There's no need to spend so long harping on over a simple mistake. Hut 8.5 07:57, 14 December 2018 (UTC)
I've watched this entire episode unfold in real time. I wasn't able to keep it from unspooling further, but I couldn't have summarized it better than that. Thanks and well done. Now perhaps we can be done with it. JohnInDC (talk) 11:46, 14 December 2018 (UTC)
I hate to drag this out any further, but I have to object to post-facto revisions of Talk page comments such as this which render later editors' comments cryptic or nonsensical. See Wikipedia:REDACT. JohnInDC (talk) 17:16, 14 December 2018 (UTC)
First, I apologize for the post-facto revisions of my 04:27, 14 December 2018 (UTC) comment. I took to heart Hut 8.5's criticism of the style of that comment, especially since the style had evidently led to his misunderstanding it. If you want drama you can read about it here, but my major revision was to change WP Copyvio Detection Squad—a joke that went over Hut 8.5's head—to "WP copyvio detection squad". I'm not advocating the formation of such a squad, but saying that in fact it already exists—with Hut 8.5 and Username Needed being prominent members. Other than that, my only other substantive edits were adding "(3) I was willing to spend some time on this, because it was allegedly my copyvio" to the second substantive paragraph, adding "(at least more often ...)" and "(which CopyPatrol will detect a lot of)" as parenthetical clarifications, and deleting "Having added ..., and having started edits ...,,"—which I realized were unnecessary clauses that contributed to the "wall of text" impression Hut 8.5 complained of. DovidBenAvraham (talk) 07:58, 15 December 2018 (UTC)

And now let me correct Hut 8.5's mis-perception of what I have been saying in this—retitled by me—section. Although I did initially throw in some criticism of Username Needed, I came to realize that he/she had been sandbagged by trying to use CopyPatrol. Read this section of his/her personal Talk page; it clearly shows that even 7 days later it required me to identify what CopyPatrol had found as a reverse copyvio. That was because I took the time to look at it; being a member of the "squad" requires one to take as little time as possible dealing with an individual possible copyvio, in order to keep up with the ceaseless flow of WP edits. IMHO it took considerable intestinal fortitude for Username Needed to say in his/her 10:52, 12 December 2018 (UTC) comment above in this section "I will not be using the Crow's copyvio tool in the future", considering how much emotional investment members of the "squad" such as CrowCaw and Hut 8.5 have in CopyPatrol.

Here's an executive summary of my 04:27, 14 December 2018 (UTC) "wall of text" comment:

  • CopyPatrol is IMHO currently a piece of c**p, which nobody should be using until its GUI and user documentation are fixed.
  • Because CopyPatrol evidently identifies a lot more reverse copyvios than Earwig's Copyvio Detector, members of the "squad" using CopyPatrol after the required fixes are IMHO going to need additional training and a further enhancement to CopyPatrol.
  • Once an ordinary "squad" member uses a fixed and enhanced CopyPatrol to identify a reverse copyvio, IME he/she will need to turn to a "supervisory" member of the "squad" to perform the delicate task of communicating with the violator (who will likely not be a WP editor). An alternative would be to simply ignore the reverse copyvio, but IMHO a history of such ignorings could eventually result in a court decision that anybody can freely copy a WP article without attribution.

OK, I can read signs, and the one at the top of this page translates as "This page is the baseball dugout for members of the 'squad'; if you're not a member, you need to go elsewhere." Just think of me as a retired applications programmer who wandered into the dugout looking for his lost baseball and stayed to express some maybe-appropriate opinions on why it got lost; I can find my own way out. DovidBenAvraham (talk) 10:13, 15 December 2018 (UTC)

  • Seriously, drop it. Imagine what you'd do if you found out someone had made a spelling mistake in an article. You'd fix it, or point out to that person that there was a spelling mistake, and that would be it. You certainly wouldn't write thousands of words criticising the person who made the spelling mistake, complaining that the person didn't spot their own spelling mistake, railing against people who change article contents for allowing spelling mistakes to be introduced, insisting that nobody be allowed to edit Wikipedia before the GUI is changed to include an automatic spell checker, and requiring that every Wikipedia editor go on a spelling course. I was going to write another paragraph explaining that the "WP Copyvio Detection Squad" doesn't exist and why your suggested technical changes are impossible or won't work, but honestly there's no point. Hut 8.5 11:18, 15 December 2018 (UTC)

What we should learn about CopyPatrol problems from the Backup "copyvio" incident[edit]

An invalid analogy and a mis-characterization of proposed solutions are not going to help solve the evident problems with using CopyPatrol as it presently exists. I've therefore started a new section on this page that is meant only to refer as far back as my 04:27, 14 December 2018 (UTC) comment in the immediately-preceding section.

  • Invalid analogy 
    If I had made a spelling mistake in an article, I would not have been given a template warning on my personal Talk page with a next-to-last sentence reading "Wikipedia takes spelling mistakes very seriously and persistent violators of our spelling mistake policy will be blocked from editing."
  • Mis-characterization of proposed solutions 
    I didn't insist that nobody be allowed to edit Wikipedia [my emphasis] before CopyPatrol is changed to include a facility to report the File Creation Date of the outside document being compared to the WP article, and requiring that every Wikipedia editor [my emphasis] go on a CopyPatrol course. I said "at a minimum CopyPatrol needs a facility to report the File Creation Date of the outside document being compared to the WP article (whose Date Added can be approximated from View History), but I don't know enough about the details of Turnitin to know if that's feasible for a non-WP Web document." I also said "Not unless each squad member allowed to use [CopyPatrol] [my emphasis] has gone through supervised 'Learner's Permit' training that teaches how to distinguish between an outside-article-to-WP-article 'normal' copyvio and a WP-article-to-outside-article 'reverse' copyvio."
  • What I'm actually proposing as a minimum software solution 
    What if the Los Angeles County consultant, instead of deleting his 10 April 2013 copy of the Backup article, had phoned me back and said "I wrote that material first, and a Wikipedia editor copied it; delete Wikipedia's article as a copyright violation."? How would I prove that he was lying? CopyPatrol seems to be basically a new version of Earwig's Copyvio Detector that automatically turns on the Use Turnitin option. The Turnitin software "checks for potentially unoriginal content by comparing submitted papers to several databases using a proprietary algorithm. It scans its own databases, and also has licensing agreements with large academic proprietary databases." If Turnitin's own databases include the date a compared-to document was first entered into a database, having that date displayed would be a very convenient way for a member of the "squad" to distinguish between an outside-article-to-WP-article 'normal' copyvio and a WP-article-to-outside-article 'reverse' copyvio. If that's not possible, then the "squad" member examining the CopyPatrol result would have to do what I did, which is to use the Wayback Machine to approximate the date the outside article was first created. It takes time to do that, so it would be nice if CopyPatrol could automate that process. If that automation also turns out to be infeasible, would a member of the "squad" have the time to determine the direction of the copyvio—and if not why should he/she bother using CopyPatrol at all?
  • What I'm actually proposing as a minimum training solution 
    In my 10:44, 12 December 2018 (UTC) comment, I quoted the "squad" member who reverted and then un-reverted the alleged copyvio as saying "I was using User:Crow's copypartol [sic] tool, which has quite a complicated interface, and I misread it. I have no idea if there ever was any copyvio, whether it has been removed and whether it is still in the revs." That makes it obvious that the CopyPatrol GUI needs to be improved and given proper user documentation. It also makes it obvious that every "squad" member using CopyPatrol needs to be given sufficient training to correctly identify a "normal" copyvio and a "reverse" copyvio. If the copyvio is identified as a "normal" one, then the "squad" member can presumably be trusted to follow the same procedure the "squad" member originally used for my 29 November edit. If OTOH the copyvio is identified as a "reverse" one, then some member of the "squad" has to be trained to properly communicate with the violator—who is likely not to be a Wikipedia editor. If that training can't be done, then why bother to use CopyPatrol—which because it automatically uses Turnitin has a greater likelihood of detecting "reverse" copyvios than Earwig's Copyvio Detector? DovidBenAvraham (talk) 05:54, 19 December 2018 (UTC)
I think it's sufficient if we remind editors to do their very best to try to avoid mistakes, to be gracious and understanding when their mistakes are pointed out to them - and, to be equally gracious and understanding on the occasions they are drawn in by the good faith mistakes of others. JohnInDC (talk) 12:29, 19 December 2018 (UTC)
Indeed. Hut 8.5 21:47, 19 December 2018 (UTC)

For simplicity, my comment beginning this subsection left out one item that was in the Should CopyPatrol now be used by the "WP copyvio detection squad"? paragraph of my 04:27, 14 December 2018 (UTC) comment: "Also, because CopyPatrol—to avoid repeats of lengthy compare processing—maintains a database containing the results for every WP article already compared, there appears to be no way in the GUI to initiate a rerun of an article already checked; that database already contains the results of compares made in October 2016—probably with Earwig's Copyvio Detector rather than CopyPatrol—for the 'Retrospect (software)' WP article. (This makes sense for the Turnitin commercial product, because high school and college students must at the least re-submit—presumably as a different document—any paper that has been previously flagged for plagiarism.)" Have I missed some way of initiating a CopyPatrol rerun for a WP article that has already been processed? DovidBenAvraham (talk) 21:55, 23 December 2018 (UTC)

DovidBenAvraham, would you please stop editing your posts after you've made them? – if you can't get your quotation marks right the first time, please either leave them out or leave them alone. For the rest of it, backwards copying is a common occurrence, and is adequately handled when it sends an alarm message. This is a volunteer project: there isn't a squad, there isn't any training, and I'm pretty sure there's no value in prolonging this discussion. Could you please just drop it now? Justlettersandnumbers (talk) 12:06, 24 December 2018 (UTC)

Justlettersandnumbers and others: I'm not ready to drop this matter yet, because you have collectively failed to acknowledge that the only way my alleged 29 November copyvio was "adequately handled" was that the volunteer who alleged it bravely realized he/she was in over his head and accepted my word that I hadn't made a copyvio. With additional information about the faulty diagnosis supplied 7 days later, I—not that volunteer—determined that the copyvio was in fact backwards copying. I've made a couple of suggestions—based on my 40 years of experience as an applications programmer—that would make such a determination easier for a volunteer, but nobody has been willing to soberly discuss the feasibility of those suggestions.

Moreover, nobody has been willing to discuss the fact that "When issues have been resolved in CopyPatrol, the indicator will not disappear from the feed. Once a page has been flagged for potential copyvio in the feed, it will stay that way." That means that I could phone the Los Angeles County consultant and tell him that it's now safe to copy any of the contents of the "Backup" article back into the pages on his promotional website without attribution, because no WP editor can use CopyPatrol to check that WP article a second time. Of course I'm not going to make that phonecall, because a lot of people—including me—have put effort into the article over many years. Let me point out that I have made 196 edits to that article since I started on 15 November 2017, and for none of them—except for the one on 29 November—was I notified of a backwards copyvio that had evidently existed since 10 April 2013. So Earwig's Copyvio Detector doesn't detect many backwards copyvios when it uses Google instead of Turnitin, and—like the legendary piano tuner Oppornockity—CopyPatrol "tunes but once". How about dealing with that problem, instead of berating me for a trivial grammatical edit to my own comment?

This collective behavior on your part is IMHO a prime illustration of "One motivation for the project is a significant decline in the number of people considered active contributors to the flagship English-language Wikipedia: it has fallen by 40 percent over the past eight years, to about 30,000. Research indicates that the problem is rooted in Wikipedians’ complex bureaucracy and their often hard-line responses to newcomers’ mistakes, enabled by semi-automated tools that make deleting new changes easy." The same three-year-old MIT Technology Review article goes on to quote Aaron Halfaker, a senior research scientist at Wikimedia Foundation, as saying "I suspect the aggressive behavior of Wikipedians doing quality control is because they’re making judgments really fast and they’re not encouraged to have a human interaction with the person".

You volunteers may not like the term "squad" but you are one; I came across a Leaderboard for you the other day, with Diannaa at the top. In the U.S. volunteer fire fighters "go through some or all of the same training as career personnel do". I wish all of you a Happy New Year with better tools, better training, more time, and less of a "drag the wagons into a circle" attitude. DovidBenAvraham (talk) 23:40, 25 December 2018 (UTC)

Hello, Diannaa here. I am going to try to answer a couple of your questions. The CopyPatrol interface uses a subscription to Turnitin, which produces an iThenticate link on each CopyPatrol report. Clicking on the iThenticate link takes you to the report generated by that external service, which donates their material to Wikipedia. Each item listed in the "match overview" box has a "crawled on" date which is the date the matching page was detected by the iThenticate service. If you've got a failproof way of determining the creation date of a webpage, I'd appreciate knowing what it is, because even with several tools at my disposal I am not always able to answer that question. Checking as to who had content first is often but not always possible. The Wikipedia history page can tell us when a particular piece of content, but it's not always obvious where it came from, particularly if it was copied or moved from elsewhere on Wikipedia without the required attribution. I do lots of CopyPatrol reports every day and can usually but not always spot when material has been copied from another article or from an old revision of the same article. Mistakes will happen from time to time.
There was a suggested restriction that we should prevent totally new editors from using the CopyPatrol interface, and is still an open ticket, considered low priority. phab:T178700. Since the user who made the error has 3300 edits in a year and a half on Wikipedia, they would not have been stopped from attempting to help at CopyPatrol by such a barrier anyway. I do check the leaderboard and see who's been using the interface to confirm that they are experienced editors, and may perform spot checks on their work, but (like every area of Wikipedia) there is no formal training required.
You seem to imply that Wikipedia editors should take responsibility to contact people who have re-used our content without providing the required attribution. As license holders of the content we have written, we are within our rights to do so, but it's not something I personally would do, as I don't have time, and virtually every bit of material on our wiki has been reproduced in multiple locations. I have no interest in trying to police the whole Internet, content to spend my online time cleaning and maintaining Wikipedia itself.
You offered a link to this discussion and have misunderstood what is being discussed there. What they are talking about is a link to Copypatrol on the New Pages Feed. This is New Pages Feed. What they are saying that there's a new feature at New Pages Feed: a little note on the bottom right that offers a link to the CopyPatrol report if one exists. The material you quoted ("When issues have been resolved in CopyPatrol, the indicator will not disappear from the feed. Once a page has been flagged for potential copyvio in the feed, it will stay that way") means that the link at New Page Patrol will not be removed if the CopyPatrol report has been resolved. It doesn't mean that it's not possible for CopyPatrol or iThenticate to check against that website a second time. Sorry you had a bad experience. — Diannaa 🍁 (talk) 16:14, 26 December 2018 (UTC)
Thanks for your helpful response, Diannaa. With regard to the Turnitin/iThenticate service, if that service's "crawled on" date is merely the date the matching page was detected in a WP copyvio search, then that date wouldn't be useful in determining the direction of a copyvio. I was hoping that iThenticate "pre-emptively crawled" the Web. There is a service that "pre-emptively crawls" the Web; it is known as the Wayback Machine. As I said in a comment some distance above, I was able to use the Wayback Machine to approximate the date the Los Angeles County consultant created his non-WP promotional Web pages—which was years after the WP article he had copied without attribution was written. That exercise took me a few minutes of time, which members of the "squad" don't seem to have; I am still hoping that CopyPatrol could be enhanced to do that Wayback Machine lookup automatically for non-WP Web pages preliminarily identified as being involved in a copyvio.
As I've said in a comment some distance above, I personally left a voicemail message for the Los Angeles County consultant. That was sufficient for him to preemptively remove the WP article material from his promotional Web pages, but—as JohnInDC has reminded me—I shouldn't have done that without including in the message the fact that adding an attribution to WP would make the copying legal. I understand that members of the "squad" don't have the time to go after "reverse" copyvios properly, but—as a WP editor who creates new content—I would hate to have the meme spread around that "you can copy any Wikipedia article without attribution; WP won't go after you".
Maybe it's possible to get CopyPatrol to check against a specific WP article a second time, but I can't figure out how to do it. For my own infrequent use, can you please tell me how? DovidBenAvraham (talk) 19:31, 26 December 2018 (UTC)
The CopyPatrol tool is a bot that checks all additions over a certain size using the donated Turnitin service. There is no way to ask that particular bot to check a specific Wikipedia article. The way to check a specific Wikipedia article is to use Earwig's copyvio detector. It does not use the Turnitin service, but rather checks using a Google search engine, for which Google donates to us 10,000 free daily queries. There's adavantages and disadvantages to each. For example, Turnitin occasionally finds material that is no longer present on the Internet and was never archived by the Wayback Machine. and Earwig's tool will check material already present in the article, not just new additions. — Diannaa 🍁 (talk) 20:31, 26 December 2018 (UTC)
Thank you again, Diannaa. Forgive me for pursuing this incident, but 40 years in the profession from which I am retired makes me want to understand any system problems I encounter—including procedural problems as well as programming problems. I think what you're saying is that the Web page I linked to gives only the results of CopyPatrol runs on WP articles, and doesn't allow anyone to initiate a CopyPatrol run. That would explain why I couldn't figure out how to get CopyPatrol to check against a specific WP article a second time. I think what you're also implying is that there is some super-secret process that automatically runs CopyPatrol against every newly-written WP article, at a point in time when there is no possibility of a "reverse" copyvio. So there's in fact no need to enhance CopyPatrol to enable a member of the "squad" to diagnose a "reverse" copyvio, because he/she will never encounter one while using CopyPatrol results.
There's only one problem with this conclusion, which is that somehow on 29 November Username_Needed was required to analyze the results of a CopyPatrol run on the "Backup" article—an article which has been somewhat steadily expanded since it was created in 2004. I know Username_Needed hates to get a message that he/she has been mentioned in connection with what I have called this "bungle", but I've ensured he/she will get a message so that he/she will be motivated to explain here how and by whom that CopyPatrol run was made. DovidBenAvraham (talk) 03:07, 27 December 2018 (UTC)
(1) CopyPatrol assesses every edit over a certain size, not just page creations. Therefore reverse copyvios and Wikipedia mirrors are indeed encountered. (2) Username_Needed did not initiate a bot run; the bot runs endlessly (or until a problem occurs; it has currently been running flawlessly for about three months non-stop). It checks all edits over the size limit, and reports its findings at Interested patrollers visit the interface daily and attempt to determine whether or not a violation has occurred for each reported item. — Diannaa 🍁 (talk) 03:45, 27 December 2018 (UTC)
OK, so everything I said in the first two paragraphs of my 19:31, 26 December 2018 (UTC) comment turns out to be still valid. Sorry to have misread Diannaa's 20:31, 26 December 2018 (UTC) comment, folks; Username_Needed need not give any explanation. So how about discussing those first two paragraphs as part of a rational systems analysis of CopyPatrol? DovidBenAvraham (talk) 04:45, 27 December 2018 (UTC)
Thank you for your suggestions and comments. — Diannaa 🍁 (talk) 12:12, 27 December 2018 (UTC)
Actually what I said in the third paragraph of my 19:31, 26 December 2018 (UTC) comment turns out also to be still valid, provided you interpret "I can't figure out how to do it" as meaning "it's impossible for an editor to do it (other than by making an edit that's over the size limit per Diannaa)". DovidBenAvraham (talk) 13:01, 27 December 2018 (UTC)
Now let me discuss the third paragraph of my 19:31, 26 December 2018 (UTC) comment, as part of a rational systems analysis of CopyPatrol. I'll take a real-life example: the copyvios I committed that were caught by Earwig's Copyvio Detector in Report 24898397 for text added to "Retrospect (software)" at 17:12, 06 October 2016 (UTC). What I had done, as Diannaa spotted, was to copy overly-long passages (even though I properly quote-marked and referenced them) from two Retrospect Inc. User's Guides. Within a day or two I had paraphrased the offending passages. Let's assume that those User's Guides could not have been found via a Google search, so that CopyPatrol would have been needed to find the copyvios.
If either Diannaa or I had then wanted to verify that my paraphrasing was satisfactory, how could we have done so? IIRC I did not make "additions over a certain size" (which I think from the "Backup" View History must be approximately 1KB), but simply reworded the offending passages. If so, CopyPatrol would not have re-analyzed the article. Does using Earwig's Copyvio Detector, specifying a Revision ID and check-marking Use Turnitin, actually now do the re-analysis we would have wanted? If not, there's a gaping hole in the "squad"'s software + procedures for being able to follow up on detected "normal" (AKA "forward") copyvios.
I'll let you folks follow up on that question. I've contributed what I can, so I'm leaving this discussion. DovidBenAvraham (talk) 04:36, 28 December 2018 (UTC)
I need to make a correction to the second paragraph of my 04:36, 28 December 2018 (UTC) comment. Belatedly looking at the Revision History for Retrospect (software)—rather than relying on my own memory, I see that I made 6 edits from from 16 October 2016 to 18 October 2016 that each were over 1KB. Therefore I presume that CopyPatrol would have automatically run again daily during those 3 days. My edits added back rewritten sections that Diannaa had deleted several days previously because of excessive quotations.
I still think that the second paragraph of my 04:36, 28 December 2018 (UTC) comment discloses a possible hole in the "squad"'s software + procedures for being able to detect—or follow up on detected—"normal" (AKA "forward") copyvios. What if I now, knowing what I know about when CopyPatrol is run for an article, added back the sections as they were originally written—complete with excessive quotations—in small-enough edits that CopyPatrol would not be triggered?
BTW, I notice that certain edits in that Revision History have their size-in-bytes bolded. Even though some of those bolded sizes are somewhat less than 1KB, is that bolding a tipoff that that the edits would cause CopyPatrol to be run on them? DovidBenAvraham (talk) 03:12, 29 December 2018 (UTC)

With regard to the first paragraph of my 19:31, 26 December 2018 (UTC) comment, iThenticate's own FAQ Web page, in the "Against what resources will my manuscript be compared?" paragraph, says "iThenticate also maintains its own web crawler, indexing over 10 million web pages daily and totalling over 50 billion web pages." Since iThenticate thus "pre-emptively crawls" the Web, it's likely that company could supply the date a particular non-WP Web page was first crawled. As I said in that paragraph, having that date would be a great time-saver for a "squad" member needing to determine whether a detected copyvio was in the "normal" or "backwards" direction. Diannaa, don't you think some "squad" leader should ask iThenticate if it can supply those dates? DovidBenAvraham (talk) 04:03, 2 January 2019 (UTC)

With regard to the "What I'm actually proposing as a minimum training solution" paragraph in my comment beginning this sub-section, I forgot to mention one obvious training deficiency—which is also an obvious software deficiency in Template:uw-copyright. As I mentioned in the section to which this sub-section belongs, Username_Needed did not give the URL of the document I was supposed to have copied. There is no parameter in Template:uw-copyright for the URL of the document, and Username_Needed did not include the URL in the Edit Summary for his/her revert. Therefore for 7 days I was left totally in the dark about the nature of my alleged copyvio. And it was only because Username_Needed—after 7 days—gave me a link to the CopyPatrol listing that I was able to see the URL and determine that the copyvio was in fact in the backwards direction. Therefore I propose that (1) every new member of the "squad" should have one thing drilled into his/her head: "Give the URL in the Edit Summary!" and (2) Diannaa should have Template:uw-copyright modified to include a URL parameter. Based on my own experience, I'd say there is a 98% chance that any WP editor falsely accused of a copyvio will simply stop editing Wikipedia—unless that editor is determined to keep putting self-serving material into WP articles. So unless these two solutions are implemented, my prediction is that Wikipedia will continue its transition into a less-disciplined version of Facebook—which is IMHO a major reason the non-editing public has little respect for the content of WP articles. And AFAIK Facebook (which I don't use) has no position for volunteer "squaddies"—so you'll have to find something else to do with your spare time. DovidBenAvraham (talk) 11:56, 8 February 2019 (UTC)

What happens when an author wants to have his own article published[edit]

I hope to head this off at the pass before I start creating a draft.

I've been approached by a Michael M Wood who is the author of this article, on the Kentucky Historical Society Website he would like to see this article published. The man is apparently a notable, but I do believe that he is indeed notable, and with his permission an article could be created with extensive copy editing, sans the genealogy stuff. If Michael M Wood grants permission can I do a copy edit, quoting liberally? Wiki commons has a permission slip (OTRS) for images and the link, is their something similar of mainspace? ThanksOldperson (talk) 22:56, 6 February 2019 (UTC)

See WP:IOWN for the procedures for a copyright owner to grant us the rights to use content. Hut 8.5 07:38, 7 February 2019 (UTC)
Hut Thanks, but it is not I who is the author. The author has an article on the Kentucky Historical Society webpage and is interested in having that article published. Your link to me to an explanation that applies were I the author trying to publish an article. This is not the case.Oldperson (talk) 17:05, 8 February 2019 (UTC)
Sure, but the same principles apply. If the author wants the article to be published here as an article and not deleted as a copyright violation then either he can post a notice at the source releasing it under the appropriate licences or he can email OTRS as described in the link. Hut 8.5 18:25, 8 February 2019 (UTC)

Input requested in discussion regarding links to potential copyright violations[edit]

There is a discussion at WT:C#CiteSeerX copyrights and linking regarding external links to possible copyright violations (copies of academic papers on CiteSeerX, of which some fraction are not properly licensed / inappropriately uploaded) in the context of bot-assisted additions of such links to citation templates. The discussion could use both wider input in general, and input specifically from editors who are familiar with Wikipedia's policies on copyright and experience in the practical application of them. If you believe there are other talk pages which could profitably be solicited for input on this I would appreciate a pointer. --Xover (talk) 17:30, 19 March 2019 (UTC)


RfC on copyright status of[edit]

There is an RfC on the copyright status of, which appears to republish text from Gale's International Directory of Company Histories. If you're interested, please participate at WP:RSN § Rfc: — Newslinger talk 14:20, 1 April 2019 (UTC)

Discussion on copyright status of[edit]

There is a discussion on the copyright status of, which appears to republish text from various Gale publications. If you're interested, please participate at Wikipedia:Reliable sources/Noticeboard § — Newslinger talk 17:30, 10 April 2019 (UTC)

Can I be promoted to a CCI clerk?[edit]

As some may know, I've been active in this area and have been cleaning up reports. I even finished two, Wikipedia:Contributor copyright investigations/BigButterfly and Wikipedia:Contributor copyright investigations/BeeCeePhoto. However, I can only mark them as completed and am unable to archive them due only clerks being able to do so, and only one clerk (Lazygas) is currently active, and they seem to be busy with other stuff. Thus, I am asking the regulars here if they believe I meet the qualifications to become a clerk. I do not have any sort of past with copyright issues, and I believe I am knowledgeable enough in the field of copyvio to be qualified for the tools. Moved from the CCI talk page to here because it seems far more active.💵Money💵emoji💵💸 20:24, 4 May 2019 (UTC) Note: (Pinging @Sphilbrick:, @Diannaa:, and @Justlettersandnumbers:)💵Money💵emoji💵💸 13:39, 11 May 2019 (UTC)

Since there's been no response to your inquiry, I've gone ahead and archived the 4 completed cases for you. Wizardman used to look after the clerking, but he's now retired or semi-retired from Wikipedia. — Diannaa 🍁 (talk) 16:21, 25 May 2019 (UTC)
P.S. Thank you very much for your work at CCI. Appreciated — Diannaa 🍁 (talk) 16:23, 25 May 2019 (UTC)