【AI达人特训营】英文新闻文章摘要生成_AI热点日报

互联网、社交媒体的迅速发展，各类新闻文章层出不穷，读者在面对海量新闻信息时，难以有效发现哪些是自己感兴趣的新闻内容。训练赛题旨在研究构建高质量的摘要生成模型，通过向新闻文档发送简短

互联网、社交媒体的迅速发展，各类新闻文章层出不穷，读者在面对海量新闻信息时，难以有效发现哪些是自己感兴趣的新闻内容。训练赛题旨在研究构建高质量的摘要生成模型，通过向新闻文档发送简短的总结有助于读者快速了解信息内容、帮助广大阅读者迅速理解。

【ai达人特训营】英文新闻文章摘要生成 - 游乐网

每日邮报新闻文章摘要提取

项目背景

互联网、社交媒体的迅速发展，各类新闻文章层出不穷，读者在面对海量新闻信息时，难以有效发现哪些是自己感兴趣的新闻内容。训练赛题旨在研究构建高质量的摘要生成模型，通过向新闻文档发送简短的总结有助于读者快速了解信息内容、帮助广大阅读者迅速理解。

项目任务

使用真实新闻文章、运用机器学习相关技术，建立高效摘要生成模型为的新闻文档产生相应的内容概括。

项目工作概要

英文新闻数据集：https://aistudio.baidu.com/aistudio/datasetdetail/143475

训练数据集格式：ID\t内容\t摘要

测试数据集格式：ID\t内容

需要将测试数据集保存为：ID\t摘要

使用T5-Base模型微调，使用RougeL评分。

1.环境依赖安装

In [ ]

# 在paddlenlp==2.3.3环境下开发!pip install paddlenlp==2.3.3

登录后复制 In [ ]

# 使用了第三方包评估rouge-l分值!pip install rouge

登录后复制 In [3]

# 使用第三方包评估rouge-l分值的demofrom rouge import Rougerouge=Rouge()rouge_scores=rouge.get_scores("Installing collected packages","Installing ")rl_p=rouge_scores[0]['rouge-l']['p']print("rouge_scores rl_p",rl_p)

登录后复制

rouge_scores rl_p 0.3333333333333333

登录后复制

2.数据集的加载与处理

In [4]

# 读入训练数据集# 按行读入文本文件def read_file(filename):    lines = []    with open(filename, 'r', encoding='utf-8') as f:        for line in f:            lines.append({"id":line.split("\t")[0].strip(),"content":line.split("\t")[1].strip(),"summary":line.split("\t")[2].strip()})    return linestxt_path='data/data143475/train_dataset.csv'lines=read_file(txt_path)print("文件lines行数量",len(lines))for line in lines[:3]:    print("-"*30)    print(line['id'],line['content'],line['summary'])

登录后复制

文件lines行数量 9000------------------------------0 by . daily mail reporter . published : . 15:34 est , 13 july 2012 . | . updated : . 01:33 est , 16 july 2012 . kelsey grammer 's wife kayte has given birth to their first child together . the boss actor , 57 , and his 32-year-old spouse -- who were expecting twins -- are ` thrilled ' after welcoming a ` healthy baby girl ' weighing 6lbs 2oz into the world this morning in los angeles , and they have named her faith evangeline elisa grammer . but the couple revealed they tragically lost their unborn son shortly after announcing kayte was pregnant with twins . joy and heartache : kelsey grammer and kayte walsh , pictured in chicago esterday , have welcomed a baby girl , but also revealed they lost a twin boy during the pregnancy . in a personal note , they said : ` early . this morning kayte gave birth to faith evangeline elisa grammer . we . are thrilled . she was 6lbs 2oz when she entered the world at 1am on the . 13th of july in the year 2012 . mother and child are in excellent . health . ' ` we were ecstatic earlier this year , . when we announced that kayte was carrying twins . tragically we lost the . little boy shortly thereafter . this was not something we cared to make . known publicly at the time . ' ` it was unspeakably painful and we . know that people will understand our desire to keep the news private . then , as we know they will respect our privacy in this matter now . a . glorious birth with a lingering sadness is ours today . ` we choose to celebrate the life that has been given us ' : the pair released an emotional statement today . ` healthy baby girl ' : they have named the baby , who weighs 6lbs 2oz , faith evangeline elisa grammer . ` we choose to celebrate the life that . has been given us . we proudly introduce our faith to the world today . looking forward to the days ahead and the children yet to come . ' the couple -- who got married in . february 2011 and renewed their vows in june -- previously lost a child . when kayte suffered a miscarriage in 2010 . kelsey already has four kids , . spencer , 28 , and greer , 19 , from previous relationships and 10-year-old . mason and jude , seven , with ex-wife camille donatacci . the couple went public with their romance just weeks after he split from the real housewives of beverly hills star . ex wife : kelsey with real housewives star camille and their children jude and mason in 2008 . kayte gave birth to a ` healthy baby girl ' named faith evangeline elisa this morning . couple reveal ` unspeakable ' pain at losing twin boy during pregnancy . celebrating a ` glorious birth ' with ` lingering sadness '------------------------------1 by . daily mail reporter . published : . 00:04 est , 14 july 2012 . | . updated : . 01:30 est , 16 july 2012 . sylvester stallone was said to have almost collapsed with grief on learning of the death of his son yesterday . the body of sage stallone , 36 , was found by his housekeeper at his los angeles home . prescription drugs were reportedly found nearby but police said it was too early to say whether they were the cause of his death . tragedy : sylvester stallone 's son sage was found dead this afternoon in his los angeles apartment after a suspected drug overdose . he was 36 , pictured here in 2006 in hollywood . a source close to stallone said : . ` when he heard the news , sly was shocked , short of breath and almost . collapsed . he just went quiet before sobbing uncontrollably . he is a . wreck at the moment . ' sage 's aunt melanie hart told the mail on sunday : ` people are speculating that it was suicide but we really have no idea . ' there were unconfirmed reports that . sage , whose mother is stallone 's first wife sasha czack , had been dead . for four days before his body was found . a source told radaronline that medics . arrived on the scene at 3.05 pm this afternoon and spent around 25 . minutes trying to revive sage before his death was pronounced at the . scene . his body was taken straight to the coroner 's office - and the insider claims no suicide note was found . ' i suspect he had been dead for quite a while when he was discovered , ' the source told the website . ` usually medics will be at the scene . for around 45 minutes but they were out of there within half an . hour . ` there were a number of prescription bottles found at the scene but it did not appear to be suicide and no note was found . ' pronounced dead at the scene : the coroner 's van was spotted at sage 's home in los angeles along with news crews . unresponsive : the filmmaker 's body was taken straight to the coroner 's office - and not to the hospital . a 9-1-1 call was placed shortly . before 3pm and the caller said sage was n't breathing and indicated it . could be a drug overdose , radar reports . an autopsy is scheduled to take place in the next 48 hours . shortly after news of sage 's death , a . spokesman released a statement on behalf of his action hero father , 66 , . who was at the comic con film convention in san diego yesterday . ` sylvester stallone is devastated and . grief-stricken over the sudden loss of his son , ' the actor 's . spokesperson michelle bega said in the statement . ` his compassion and thoughts are with sage 's mother , sasha . ' sudden death : the body of the 36-year-old sage stallone was brought out to the coroner 's van in los angeles . devastated : sly 's agent released a statement saying he was ` grief-striken ' at the loss of his son . mystery : an autopsy is scheduled to take place in the next 48 hours to determine the cause of death . earlier : sly was at comic com yesterday evening . red carpet smiles : sage pictured in 1996 at the hollywood premiere of daylight with his father sylvester and his now-wife jennifer flavin . double act : sage appeared alongside his father in the 1990 movie rocky v , playing the role of rocky 's son robert balboa . ` he was a very talented and wonderful young man . his loss will be felt forever . ' police said they found the younger . stallone in the home while responding to a ` welfare check ' , however . sage 's lawyer george braunstein said he was found by a housekeeper . friends and acquaintances had become concerned because they had n't heard from sage in the past day . braunstein said the death came as a shock , telling the new york post this afternoon : ` he was in good spirits , and working . on all kinds of projects . ` he was planning on getting married . i am just devastated . he was an extremely wonderful , loving guy . this is a tragedy . ' before the heartbreak : stallone was pictured yesterday with arnold schwarzenegger at the comic con film convention in san diego . sage moonblood stallone was the . oldest of sylvester stallone 's children and co-starred with his father . in two films . he was the first of two sons stallone had with first wife . sasha czack . he made his acting debut in 1990 's . rocky v - he played his stallone 's onscreen son - and also appeared with . his father in 1996 's daylight . hand in hand : sylvester pictured back in 1982 with his first wife sasha czack , sage 's mother . also in 1996 , sage stallone and . veteran film editor bob murawski co-founded grindhouse releasing , a . company dedicated to preserving and promoting the b-movies and . exploitation films of the 1970s and 80s . he also directed the 2006 short vic , which screened at the palm springs film festival . braunstein said sage had frequent requests to work on films . ` he was a full of life filmmaker with . his whole future ahead of him , ' he said . ` he was just very up and . enthusiastic and positive . ' i think it was probably some sort of accident , ' he said of the death . braunstein added that sage stallone greatly admired his father but was working hard to make his own name in the film industry . ` he was very proud of his father and proud to be his father 's son , ' he said . stallone 's split from sage 's mother czack in 1985 after 11 years together . they also have a another son . seargeoh , 32 , who is autistic . stallone went on to wed model and actress brigitte . nielsen in beverly hills but they split just two . years later in a very public divorce . he married third wife , jennifer . flavin , in 1997 after an eight-year on-again , off-again relationship and . they have three daughters : sophia rose , 15 , sistine rose , 14 , and . scarlet rose , 10 . sage , who was raised by his mother following his parents ' divorce , felt distant from his father growing up , a theme which hit home as they were filming rocky v together . big boots to fill : sage said he always worried about living up to his father 's success , seen here together again in rocky v . ` when i was screaming , `` you never spent time with me ! you never spent time with my mother ! '' - that was true , ' he told people magazine in 1996 . ` i was looking into my father 's face and really saying that . ' but it proved a turning point for the father and son , who went on to form a close bond and they acted again together in the 1996 film daylight . ` between takes , sly and sage would roll around in the dirt like two puppies , ' the director rob cohen observed at the time . sage certainly felt the pressure of growing up with such a famous father and would worry that he would never match his success . ` i tell him , `` as long as you give it your best , that 's all that matters , '' his mother sasha said in that same year . sage went on to pursue a career behind the camera and shunned the wild hollywood party scene , preferring to watch horror zombie films instead . ` people call me a hermit , ' he said while promoting the film . ` but i 'm happy . ' star ` devastated and grief-stricken ' over sudden loss of his eldest child . sage played the 66-year-old 's onscreen son in rocky v . an autopsy is scheduled to take place in the next 48 hours after filmmaker was found next to prescription drugs .------------------------------2 by . rob waugh . published : . 02:50 est , 15 may 2012 . | . updated : . 08:19 est , 15 may 2012 . it looks like a floating car seat - but honda 's robotic uni-cub unicycle lets ordinary people do what used to be the province of circus performers , and stay upright on just one wheel . you simply lean to ` drive ' , and to steer - and in a demonstration in tokyo this week , volunteers piloted it with ease . when not in use , it can be folded up into a tiny carry case - although its top speed , 3.7 mph , not far off walking speed , may mean it 's not enormously popular with commuters . # . honda unveiled the new device on tuesday , which allows the rider to control its speed , up to 3.7 mph per hour , and direction by shifting one 's own weight . so far , there is no release date for the robot unicycle , which has a top speed of 3.7 mph . the device was shown off by honda this week - as yet , there are no plans for a release date . the device is meant to be nimble enough that it can be used indoors . as well as cars and motorcycles , honda also has a long track record in robotics , with a humanoid robot , asimo , that is a regular at its stage shows . swaying your body from side to side is all you need to do to turn , rotate full circle and zip around on the uni-cub . the uni-cub has one main wheel , while a tiny wheel at the back helps for circular moves . reporters got a test ride on the machine tuesday . it takes some getting used to but responds smoothly and quietly . lean forward to go straight , to the left to go left . if all fails to stop , just put your foot down . uni-cub will be on display at a tokyo science museum . there are no plans yet for a commercial product . the single . wheel on the u3-x is made . up of many tiny motor-controlled wheels , packed inside the bigger . wheel , allowing the device to swerve in any direction . the u3-x weighs just under 22 pounds , runs . on a full charge for an hour , and has a lithium-ion battery . it is best suited to those over 5ft . mamoru mori , executive director of national museum of emerging science and innovation -lrb- miraikan -rrb- and former astronaut , rides honda motor co 's new uni-cub personal mobility device at the museum in tokyo . swaying your body from side to side is all you need to do to turn , rotate full circle and zip around on the uni-cub , which looks a bit like a floating car seat . former space shuttle endeavour mission specialist mamoru mohri demonstrates honda 's new robotics technology , uni-cub . it may look a little precarious and . uncomfortable to ride , but honda believe their new ` personal mobility ' device could one day be zipping up and down our streets . the . vehicle looks like a very modern unicycle and to ride it you simply . lean your weight in the direction you want to go , whether that 's . forward , backwards or even sideways . it maintains its own balance . travelling up to 3.7 mph . a slippery slope ? pixar film wall-e predicted humans would become too obese to walk in the future after relying on technology . the u3-x can be easily carried -lrb- left -rrb- . like the . segway -lrb- pictured with former japanese prime minister junichiro koizumi , . right -rrb- the segway the u3-x moves when you shift your weight . honda makes the asimo walking child-shaped . robot and the u3-x uses some of the same technology . last year , honda also unveiled a gadget that can . support a wearer 's bodyweight , made of mechanical frames attached to a . pair of shoes . japanese . rival toyota motors has shown machines that help people get . around , including the winglet , similar to the segway , a scooter-like . device that people ride standing up . japan is one of the most rapidly aging societies in the world , and concerns are growing about helping the elderly get around . ` honda . engineers are always thinking about people 's dreams and wishes about . mobility . we will continue to work hard to be a leader in that area , ' mr . ito said . tiny unicycle has top speed of 3.7 mph . can be folded up into carry case . like segway , robot ` brains ' keep rider upright . designed to be nimble enough to be used indoors .

登录后复制

3.T5模型加载

这里请注意，可以加载你之前训练保存下来的节点继续训练。

In [5]

# # 使用 PaddleNLP 加载from paddlenlp.transformers import T5Tokenizer, T5ForConditionalGenerationmodel_name_or_path="t5-base"# 如果你需要断点训练，可以载入保存下来的模型# model_name_or_path="20240620244319_epoch_2_epoch_loss_1.02523"tokenizer = T5Tokenizer.from_pretrained(model_name_or_path)model = T5ForConditionalGeneration.from_pretrained(model_name_or_path)print("t5-base模型加载完成")

登录后复制

W0621 08:32:36.855329   212 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1W0621 08:32:36.858129   212 gpu_context.cc:306] device: 0, cuDNN Version: 7.6.

登录后复制

t5-base模型加载完成

登录后复制

4.数据集拆分，组data_loader

因为T5模型不是按字切分的，所有这里有使用numpy和tokenizer.tokenize将分词最长长度和平均长度统计出来

In [6]

# # 因为内容数据比摘要长，这里就计算内容的分词后的max长度作为输入长度。# all_content_list=[x['content'] for x in lines]# all_content_len_list=[ len(tokenizer.tokenize(x)) for x in all_content_list]# print("all_content_len_list",all_content_len_list)

登录后复制 In [7]

# #利用numpy提取出最大的分词长度# import numpy as np# mean_len_content=np.mean(all_content_len_list)# print("mean",np.mean(all_content_len_list))# print("max",np.max(all_content_len_list))# 这里算下来的max将用于下方的token长度，因为是t5模型，长度不限制的，如果是其他模型，估计要把先抽取文章关键句到500内再送给模型# mean 1202.3557777777778# max 3655

登录后复制 In [ ]

#数据集的处理import paddlefrom functools import partialfrom paddlenlp.data import Tuple, Pad# 加载数据集from paddlenlp.datasets import MapDataset# 清理GPU显存paddle.device.cuda.empty_cache()# 批量数据大小batch_size = 1# 文本序列最大长度#这里就是输入的长度了。max_seq_length = 3800   #这里就是输入的长度了。split_num=int(len(lines)*0.9)train_ds=MapDataset(lines[:split_num])dev_ds=MapDataset(lines[split_num:])def convert_example(example, tokenizer, max_seq_length=1024):    source = tokenizer(        example["content"],        max_seq_len=max_seq_length,        pad_to_max_seq_len=True,        return_token_type_ids=False,        return_attention_mask=True, )    target = tokenizer(        example["summary"],max_seq_len=max_seq_length, pad_to_max_seq_len=True, return_token_type_ids=False, return_attention_mask=True)    return (        source["input_ids"],        source["attention_mask"],        target["input_ids"],        target["attention_mask"],)# 将数据处理成模型可读入的数据格式train_ds.map(partial(convert_example, tokenizer=tokenizer, max_seq_length=max_seq_length))dev_ds.map(partial(convert_example, tokenizer=tokenizer, max_seq_length=max_seq_length))batchify_fn = lambda samples, fn=Tuple(    Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input_ids    Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # attention_mask    Pad(axis=0, pad_val=-100, dtype="int64"),  # lm_labels    Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # decoder_attention_mask): fn(samples)# shuffle (bool) - 是否需要在生成样本下标时打乱顺序。默认值为False。batch_sampler = paddle.io.BatchSampler(dataset=train_ds, batch_size=batch_size, shuffle=True)# DataLoader返回一个迭代器，该迭代器根据 batch_sampler 给定的顺序迭代一次给定的 dataset# 若return_list = True，则每个设备上的返回数据均是list(Tensor)。train_data_loader = paddle.io.DataLoader(dataset=train_ds, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)print("train_ds数据集加载完成, 数据集大小:", len(train_ds))print("dev_ds数据集加载完成, 数据集大小:", len(dev_ds))print("train_ds[0]",train_ds[0])print("dev_ds[0]",dev_ds[0])

登录后复制 In [7]

#按时间排序某个目录，并且只保留该目录的topN个文件夹or文件import osimport shutildef rm_backup_keep_num(rm_path, keep_number=10000, show_log=False):    if os.path.exists(rm_path) == False:        return    files_list = os.listdir(rm_path)    list = []    dict = {}    for i in files_list:        all_path = os.path.join(rm_path, i)        ctime = os.path.getctime(all_path)        dict[all_path] = ctime    AllPathCtimeList = sorted(dict.items(), key=lambda item: item[1])    # sorted方法可按照字典的key和value进行排序，这里的key是一个lambda函数，表示按照选取元组dict.items()中的第二个元素进行排序    if keep_number>0:        need_rm_list = AllPathCtimeList[:-1 * keep_number]    else:        need_rm_list = AllPathCtimeList    for i in need_rm_list:        if i[0].find(".ipynb_checkpoints")>-1:            continue        if show_log:            print("删除文件or文件夹：", i[0])        # 判断是文件还是文件夹,进行删除        if os.path.isfile(i[0]):            os.remove(i[0])        else:            shutil.rmtree(i[0])rm_backup_keep_num('./ckpt_model', 10)

登录后复制

5.一个模型评估函数

将反回RougeL分值，用于保存最优的模型。

In [17]

# 评估函数，返回ROUGE-L值，用于保存最优的模型来进行预测。#因为是训练培训任务，需要ROUGE-L来计排名的，所以这里用test数据，保存一个最好的ROUGE-L模型# PaddleNLP Metrics API    paddlenlp.metrics.RougeL# https://paddlenlp.readthedocs.io/zh/latest/metrics/metrics.html# from paddlenlp.metrics import RougeL# rougel = RougeL()# cand = ["The","cat","The","cat","on","the","mat"]# ref_list = [["The","cat","is","on","the","mat"], ["There","is","a","cat","on","the","mat"]]# rougel.add_inst(cand, ref_list)# print(rougel.score()) # 0.7800511508951408from rouge import Rougeimport numpy as npfrom paddlenlp.metrics import RougeLdef test_dev(model, tokenizer):    model.eval()    score_list=[]    rouge_score_list=[]    i=0    for dev_item in dev_ds:        i=i+1        # 最长评估100条        if i>100:            break        source_ids, source_mask, labels, target_mask = dev_item        input_ids = paddle.to_tensor([source_ids], dtype="int64")        outputs = model.generate(                        input_ids,                         num_beams=5, #beam_search每个时间步保留的数量                        # num_beam_groups=1,#为了多样性，保留几组数据进行返回                        min_length=20,  # 要生成的序列的最小长度。默认值为 0。                        max_length=500,  # 要生成的序列的最大长度。                        eos_token_id=tokenizer.eos_token_id,                        temperature=1.0,                        top_k=5,                        top_p=1.0,                        repetition_penalty=10.0,                        length_penalty=1.0,                        decode_strategy="beam_search", #生成中的解码策略。目前支持三种解码策略："greedy_search", "sampling" and "beam_search"。默认为“greedy_search”。                        num_return_sequences=1,                        # use_faster=False, #如果可用快速模式，则用快速模式                        )        out_str_list = []        output_ids_list=outputs[0]        output_score_list=outputs[1]                # print("output_ids_list",output_ids_list)        # print("output_score_list",output_score_list)        for index,output_item in enumerate(zip(output_ids_list,output_score_list)):                output_ids=output_item[0]                output_score=output_item[1]                # print("output_score",output_score.item())                # skip_special_tokens=True 跳过特殊令牌                out_str = tokenizer.decode(output_ids, skip_special_tokens=True).strip()                # print("output_score_list[index]",output_score_list[index][0])                # score = output_score_list[index][0]                                score=output_score.item()                print("预测out_str：",out_str)                # out_str_list.append("["+str(score)+"]"+out_str)                out_str_list.append(out_str)                # out_str_list.append(out_str.replace(" ",""))                print("out_str_list",out_str_list)        # print("source_ids",tokenizer.decode(source_ids,skip_special_tokens=True))        # print("labels",tokenizer.decode(labels,skip_special_tokens=True))        label_str=tokenizer.decode(labels,skip_special_tokens=True)        # 使用rouge第三方包计算分值，label_str和out_str_list[0]都是字符串        rouge=Rouge()        rouge_scores=rouge.get_scores(out_str_list[0], label_str)        rl_p=rouge_scores[0]['rouge-l']['p']        rouge_score_list.append(rl_p)        # 使用飞桨的api计算rougel分值        rougel = RougeL()        cand=out_str_list[0].split(" ")        ref_list=[label_str.split(" ")]        # 飞桨的api计算rougel分值的demo        # print("cand",cand)        # print("ref_list",ref_list)        # cand = ["The","cat","The","cat","on","the","mat"]        # ref_list = [["The","cat","is","on","the","mat"], ["There","is","a","cat","on","the","mat"]]        rougel.add_inst(cand, ref_list)        score=rougel.score()        score_list.append(score)        # print("----"*2)    print(f"{len(score_list)}条dev预测数据","飞桨的评分 rougeL评分：", np.mean(score_list))    print(f"{len(score_list)}条dev预测数据","第三方包的 rouge-l评分：", np.mean(rouge_score_list))    model.train()    return np.mean(score_list)# test_dev(model, tokenizer)

登录后复制

6.训练模型并自动保存最优

我自己使用V100训练了了大概2小时。

In [ ]

#训练模型，这里可以每隔多少步打印日志。多少步评估并保存最优模型。import time# 训练轮次epochs = 30# 开启训练global_step = 0every_log_step = 20every_save_epoch = 1every_save_step = 5000keep_ckpt_best_num=2  #最优模型保留topN个keep_ckpt_num=3  #模型保留topN个断点# 训练过程中保存模型参数的文件夹ckpt_dir = "./ckpt_model"ckpt_best_dir = "./ckpt_best_model"# 保留最新的N个文件夹rm_backup_keep_num(ckpt_dir, keep_ckpt_num)# len(train_data_loader)一轮训练所需要的step数num_training_steps = len(train_data_loader) * epochs# Adam优化器optimizer = paddle.optimizer.AdamW(learning_rate=1e-5, parameters=model.parameters())# 交叉熵损失函数criterion = paddle.nn.loss.CrossEntropyLoss()# accuracy评价指标metric = paddle.metric.Accuracy()# 全局loss 上次记录的loss值epoch_all_loss,logging_loss=0.0,0.0last_model_score=0tic_train = time.time()for epoch in range(1, epochs + 1):    epoch_all_loss=0.0    logging_loss=0.0    save_loss=0.0    model.train()    for step, batch in enumerate(train_data_loader):        # 训练模式        model.train()        global_step+=1        source_ids, source_mask, labels, target_mask = batch        outputs = model(            input_ids=source_ids,            attention_mask=source_mask,            labels=labels,            decoder_attention_mask=target_mask, )                loss = outputs[0]        epoch_all_loss += loss.item()        # 计算损失        if global_step % every_log_step == 0:            print("global_step:%s, every_log_step:%s every_log_step_loss:%.5f  " % ( global_step, every_log_step, (epoch_all_loss - logging_loss)/every_log_step) )            logging_loss=epoch_all_loss        # 反向梯度回传，更新参数        loss.backward()        optimizer.step()        optimizer.clear_grad()        if global_step % every_save_step == 0:            model_score=test_dev(model,tokenizer)            if model_score >= last_model_score:                last_model_score=model_score                print("保存best模型")                time_str=time.strftime('%Y%m%d%H%M%S', time.localtime())                save_dir = os.path.join(ckpt_best_dir, "%s_step_%d_score_%0.5f_loss_%.5f" % (time_str, global_step, model_score, (epoch_all_loss - save_loss)/every_save_step) )                save_loss=epoch_all_loss                if not os.path.exists(save_dir):                    os.makedirs(save_dir)                # 保存当前模型参数等                model.save_pretrained(save_dir)                # 保存tokenizer的词表等                tokenizer.save_pretrained(save_dir)                                # 保留最新的N个文件夹                rm_backup_keep_num(ckpt_best_dir, keep_ckpt_best_num)            time_str=time.strftime('%Y%m%d%H%M%S', time.localtime())            save_dir = os.path.join(ckpt_dir, "%s_step_%d_loss_%.5f" % (time_str, global_step, (epoch_all_loss - save_loss)/every_save_step) )            save_loss=epoch_all_loss            if not os.path.exists(save_dir):                os.makedirs(save_dir)            # 保存当前模型参数等            model.save_pretrained(save_dir)            # 保存tokenizer的词表等            tokenizer.save_pretrained(save_dir)                        # 保留最新的N个文件夹            rm_backup_keep_num(ckpt_dir, keep_ckpt_num)            del time_str, save_dir, outputs, loss            paddle.device.cuda.empty_cache()                    epoch_loss = epoch_all_loss / len(train_data_loader)    print("-"*30)    print("epoch", epoch, "epoch_loss", epoch_loss)    print("-"*30)    if epoch % every_save_epoch == 0:        # 保存普通轮次模型        time_str=time.strftime('%Y%m%d%H%M%S', time.localtime())        save_dir = os.path.join(ckpt_dir, "%s_epoch_%d_epoch_loss_%.5f" % (time_str,epoch, epoch_loss))        if not os.path.exists(save_dir):            os.makedirs(save_dir)        # 保存当前模型参数等        model.save_pretrained(save_dir)        # 保存tokenizer的词表等        tokenizer.save_pretrained(save_dir)        # 保留最新的N个文件夹        rm_backup_keep_num(ckpt_dir, keep_ckpt_num)        del time_str, save_dir, epoch_loss        paddle.device.cuda.empty_cache()        print("训练完成")

登录后复制

7.使用模型预测测试集，并保存相关结果

保存预测结果为submit.csv

预测太慢了，飞桨最新在20240621的paddleNLP2.3.3版本还没有支持T5模型的use_faster模式希望最新早些支持呢。

在单条预测的情况下，1000条测试集预测预计需要20min+，如果有use_faster支持后，可以提速10倍+，可以考虑使用组合batch传入模型提速。

In [ ]

def yuce(in_content_str):    model.eval()    source = tokenizer(        in_content_str,        # max_seq_len=max_seq_length, #因为是单条数据，就不padding了。        pad_to_max_seq_len=True,        return_token_type_ids=False,        return_attention_mask=True, )        # print("source", source)    source_ids=source['input_ids']    source_mask=source['attention_mask']    input_ids = paddle.to_tensor([source_ids], dtype="int64")    outputs = model.generate(                    input_ids,                     num_beams=5, #beam_search每个时间步保留的数量                    # num_beam_groups=1,#为了多样性，保留几组数据进行返回                    min_length=40,  # 要生成的序列的最小长度。默认值为 0。                    max_length=500,  # 要生成的序列的最大长度。                    eos_token_id=tokenizer.eos_token_id,                    temperature=1.0,                    top_k=5,                    top_p=1.0,                    repetition_penalty=10.0,                    length_penalty=1.0,                    decode_strategy="beam_search", #生成中的解码策略。目前支持三种解码策略："greedy_search", "sampling" and "beam_search"。默认为“greedy_search”。                    num_return_sequences=1,                    use_faster=True, #如果可用快速模式，则用快速模式                    )    out_str_list = []    output_ids_list=outputs[0]    output_score_list=outputs[1]        for index,output_item in enumerate(zip(output_ids_list,output_score_list)):        # 这里可以使用output_score排序分值最高的进行返回，偷懒就直接返回out_str_list[0]了        output_ids=output_item[0]        output_score=output_item[1]        # skip_special_tokens=True 跳过特殊令牌        out_str = tokenizer.decode(output_ids, skip_special_tokens=True).strip()                score=output_score.item()        print("预测out_str：",out_str)        out_str_list.append(out_str)        print("out_str_list",out_str_list)    return out_str_list[0]        # 按行读入文本文件def read_test_file(filename):    lines = []    with open(filename, 'r', encoding='utf-8') as f:        for line in f:            lines.append({"id":line.split("\t")[0].strip(),"content":line.split("\t")[1].strip()})    return linestxt_path='data/data143475/test_dataset.csv'test_lines=read_test_file(txt_path)print("文件test_lines行数量",len(test_lines))# for line in test_lines[:3]:#     print("-"*30)#     print(line['id'],line['content'])out_csv_list=[]for line in test_lines:    test_id=line['id']    content=line['content']    summary=yuce(content)    out_csv_list.append(test_id+"\t"+summary+"\n")    print(test_id,len(test_lines))print("-"*20)print("test len",len(test_lines))print("out_csv_list len",len(out_csv_list))

登录后复制

文件test_lines行数量 1000预测out_str： mentally ill inmates at miami-dade pretrial detention facility are housed on " forgotten floor " judge steven leifman says about one-third of all people in jails are mentally ill. starting in 2008, many will be sent to a new mental health facility.out_str_list ['mentally ill inmates at miami-dade pretrial detention facility are housed on " forgotten floor " judge steven leifman says about one-third of all people in jails are mentally ill. starting in 2008, many will be sent to a new mental health facility.']0 1000预测out_str： harry potter star daniel radcliffe turns 18 on monday. the young actor has access to a reported # 20 million fortune. at 18, he will be able to gamble in a casino or buy a drink in a pub. his earnings from the first five potter films have been held in a trust fund.out_str_list ['harry potter star daniel radcliffe turns 18 on monday. the young actor has access to a reported # 20 million fortune. at 18, he will be able to gamble in a casino or buy a drink in a pub. his earnings from the first five potter films have been held in a trust fund.']1 1000预测out_str： driver recalls 30 -, 35-foot free fall on mississippi bridge. " it just gave way, and it just fell completely, all the way to the ground, " survivor says. rescue effort was organized opposite of lightning-quick collapse, doctor says.out_str_list ['driver recalls 30 -, 35-foot free fall on mississippi bridge. " it just gave way, and it just fell completely, all the way to the ground, " survivor says. rescue effort was organized opposite of lightning-quick collapse, doctor says.']2 1000预测out_str： 5-year-old youssif was set on fire outside his california home in january. parents say they put themselves in incredible danger by trying to help him. the children's burn foundation agrees to pay for transportation and medical expenses.out_str_list ["5-year-old youssif was set on fire outside his california home in january. parents say they put themselves in incredible danger by trying to help him. the children's burn foundation agrees to pay for transportation and medical expenses."]3 1000预测out_str： five small polyps were removed from president bush's colon on saturday. all were small, less than a centimeter -lsb- half an inch -rsb- in diameter. vice president dick cheney assumed presidential power during procedure. afterward the president played with his scottish terriers, barney and miss beazley.out_str_list ["five small polyps were removed from president bush's colon on saturday. all were small, less than a centimeter -lsb- half an inch -rsb- in diameter. vice president dick cheney assumed presidential power during procedure. afterward the president played with his scottish terriers, barney and miss beazley."]4 1000

登录后复制 In [ ]

# print("out_csv_list",out_csv_list)with open("submit.csv",'w')as file:    file.write("".join(out_csv_list))print("请使用submit.csv评测。")

登录后复制